GPT-2: Language Models are Unsupervised Multitask Learners
This document summarizes a technical paper about GPT-2, an unsupervised language model created by OpenAI. GPT-2 is a transformer-based model trained on a large corpus of internet text using byte-pair encoding. The paper describes experiments showing GPT-2 can perform various NLP tasks like summarization, translation, and question answering with limited or no supervision, though performance is still below supervised models. It concludes that unsupervised task learning is a promising area for further research.
GPT-2: Language Models are Unsupervised Multitask Learners
1.
Language Models are
Unsupervised Multitask Learners
(GPT-2)
OpenAI
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.03.03
Presented by Young Seok Kim
PR-145
Related Papers
• Vaswani,Ashish et al. “Attention Is All You Need.” NIPS (2017)
• PR-049: https://youtu.be/6zGgVIlStXs
• Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html
• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)
• Website: https://blog.openai.com/language-unsupervised/
• Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf
• Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)
• Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html
• Paper: https://arxiv.org/abs/1810.04805
• PR-121: https://youtu.be/GK4IO3qOnLc
!3
Common Crawl?
!6
• Significantdata quality issues.
• Best results were achieved when using a small
subsample of common crawl which included only
documents most similar to the target dataset
• Authors of GPT-2 wanted to avoid making
assumptions about the tasks to be performed
ahead of time.
7.
WebText
• GPT-2 authorscreated a new web scrape which
emphasizes document quality
• They scraped web pages which have been curated/
filtered by humans
• Manually filtering a full web scrape would be
exceptionally expensive
• Scraped all outbound links from Reddit, which
received at least 3 karma
• Heuristic indicator for whether other users found the
link interesting / educational / or just funny
!7
Karma > 3
8.
WebText
• 45 millionlinks
• Used content extractors to extract the text from HTML
• De-duplication
• heuristic based cleaning
• slightly over 8 million documents
• 40 GB of text
• Removed ALL Wikipedia documents
• since it is a coomon data source for other datasets and could complicate analysis due to overlapping
training data with test evaluation tasks
!8
Byte Pair Encoding(BPE)
• Sennrich, Rico et al.
“Neural Machine Translation of Rare Words with Subword Units.” (2016)
• Practical middle ground between character level and word level language modeling
• Effectively interpolates between word level inputs for frequent symbol sequences and
character level inputs for infrequent symbol sequences
• Combined empirical benefits of word-level LMs with the generality of byte-level
approaches
• This approach can assign a probability to any Unicode string, regardless of pre-
processing, tokenization or vocabulary size
!10
Transformer
• Transformer-based
•Follows the details of GPT-1
• Layer Normalization was moved to the input of each sub-block
(similar to pre-activation in ResNet)
• Additional LayerNorm was added after the final self-attention
block.
• Vocab is expanded to 50,257
• Batchsize of 512 is used
!13
Original Transformer
Children’s Book Test
•Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations.” (2016)
• Reports accuracy on automatically constructed cloze test where the task is to predict
which of 10 possible choices for an omitted word is correct.
• GPT-2 authors compute the probability of each choice and the rest of sentence
conditioned on this choice according to LM, and predict the one with highest
probability.
!17
18.
LAMBADA
• LAnguage ModelingBroadened to Account for Discourse Aspects
• Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad
discourse context.” (2016)
• Task is to predict the final word of sentences which require at least 50 tokens of
context for a human to successfully predict
• 99.8 PPL -> 8.63 PPL
!18
19.
Winograd Schema Challenge
•Commonsense reasoning by
measuring its ability to resolve
ambiguities in text
!19
Summarization
• Added text“TL;DR:” after the
article and generated 100 tokens
with Top-k random sampling with
k=2
• CNN and Daily Mail dataset
• Used 3 generated sentences in
these 100 tokens as the summary
!21
22.
Translation
• ‘english sentence= french sentence’ format
• Generate text after ‘english sentence = ’
• Sample from the model with greedy decoding and use the first generated sentence as the translation
• GPT-2 gets 5 BLEU on WMT-14 English-French test set
• GPT-2 gets 11.5 BLEU on WMT-14 French-English test set
• Outperforms several unsupervised machine translation baselines (2017)
• But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019)
!22
23.
Translation
• Surprising result!
•Authors of GPT-2 deliberately removed non-English webpages from WebText as a
filtering step
• Authors ran byte-level language detector on WebText
• Only 10MB of data in the French language
• (Approximately 500x smaller than the monolingual French corpus common in prior
unsupervised machine translation research)
!23
24.
Question Answering
• GPT-2answers 4.1% of questions correctly when evaluated by the exact match metric
commonly used on reading comprehension datasets like SQUAD
• Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which
returns the most common answer for each question type (who, what, where, etc…)
• -> Model capacity is important
• But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most confident in
!24
25.
Generalization vs Memorization
•It is important to analyze how much test data also shows up in the training data
• Using Bloom Filters, authors found out what percentage of (test) dataset is found in
WebText training set.
!25
Conclusionss
• Unsupervised tasklearning is an additional promising area of research to explore
• Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting.
• on reading comprehension
• but not on other tasks like summarization, etc…
• Studied zero-shot performance of WebText LMs on many canonical NLP tasks
!27
Personal Thoughts
• Ratherthan focusing on novel model architecture, the paper focuses on unsupervised
task learning, evaluating / analyzing on various canonical datasets / tasks
• Compared to the hype, the model is quite less achieving
• Scaling is important. Modern research by huge companies have already transitioned to
huge models
• Zero-shot learning is interesting
!29
30.
How do youthink about
OpenAI not releasing the model?
(Is it ethical for OpenAI to keep the big model private?)
• Propagate Fear
• Reproducibility issue
• Making unnecessary hype
!30
• May be used for malicious use such as
• Generate misleading news articles
• Automate the production of abusive or faked
content to post on social media
• Automate the production of spam/phishing
content