GPT-2: Language Models are Unsupervised Multitask Learners

Language Models are  
Unsupervised Multitask Learners 
(GPT-2)
OpenAI
Alec Radford, Jeﬀrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.03.03
Presented by Young Seok Kim
PR-145

Articles & Useful Links
• Oﬃcial

• Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

• Blog: https://blog.openai.com/better-language-models/

• GitHub: https://github.com/openai/gpt-2

• Unoﬃcial

• Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/
!2

Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049: https://youtu.be/6zGgVIlStXs

• Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website: https://blog.openai.com/language-unsupervised/

• Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf

• Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)

• Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• Paper: https://arxiv.org/abs/1810.04805

• PR-121: https://youtu.be/GK4IO3qOnLc
!3

Dataset (BERT)
!5
BookCorpus
(800M words) Wikipedia
(2500M words)
+

Common Crawl?
!6
• Signiﬁcant data quality issues.

• Best results were achieved when using a small
subsample of common crawl which included only
documents most similar to the target dataset

• Authors of GPT-2 wanted to avoid making
assumptions about the tasks to be performed
ahead of time.

WebText
• GPT-2 authors created a new web scrape which
emphasizes document quality

• They scraped web pages which have been curated/
ﬁltered by humans

• Manually ﬁltering a full web scrape would be
exceptionally expensive

• Scraped all outbound links from Reddit, which
received at least 3 karma

• Heuristic indicator for whether other users found the
link interesting / educational / or just funny
!7
Karma > 3

WebText
• 45 million links

• Used content extractors to extract the text from HTML

• De-duplication

• heuristic based cleaning

• slightly over 8 million documents

• 40 GB of text

• Removed ALL Wikipedia documents

• since it is a coomon data source for other datasets and could complicate analysis due to overlapping
training data with test evaluation tasks
!8

Byte Pair Encoding (BPE)
• Sennrich, Rico et al.  
“Neural Machine Translation of Rare Words with Subword Units.” (2016)

• Practical middle ground between character level and word level language modeling

• Eﬀectively interpolates between word level inputs for frequent symbol sequences and
character level inputs for infrequent symbol sequences

• Combined empirical beneﬁts of word-level LMs with the generality of byte-level
approaches

• This approach can assign a probability to any Unicode string, regardless of pre-
processing, tokenization or vocabulary size
!10

Byte Pair Encoding
(BPE)
Sennrich, Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)

Transformer
• Transformer-based

• Follows the details of GPT-1

• Layer Normalization was moved to the input of each sub-block  
(similar to pre-activation in ResNet)

• Additional LayerNorm was added after the ﬁnal self-attention
block.

• Vocab is expanded to 50,257

• Batchsize of 512 is used
!13
Original Transformer

Model sizes
!15
(BERT)GPT-2
GPT-1
BERT-large
GPT-2

Children’s Book Test
• Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations.” (2016)

• Reports accuracy on automatically constructed cloze test where the task is to predict
which of 10 possible choices for an omitted word is correct.

• GPT-2 authors compute the probability of each choice and the rest of sentence
conditioned on this choice according to LM, and predict the one with highest
probability.
!17

LAMBADA
• LAnguage Modeling Broadened to Account for Discourse Aspects

• Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad
discourse context.” (2016)

• Task is to predict the ﬁnal word of sentences which require at least 50 tokens of
context for a human to successfully predict

• 99.8 PPL -> 8.63 PPL
!18

Winograd Schema Challenge
• Commonsense reasoning by
measuring its ability to resolve
ambiguities in text
!19

Winograd Schema Challenge
Trinh, Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)

Summarization
• Added text “TL;DR:” after the
article and generated 100 tokens
with Top-k random sampling with
k=2

• CNN and Daily Mail dataset

• Used 3 generated sentences in
these 100 tokens as the summary
!21

Translation
• ‘english sentence = french sentence’ format

• Generate text after ‘english sentence = ’

• Sample from the model with greedy decoding and use the ﬁrst generated sentence as the translation

• GPT-2 gets 5 BLEU on WMT-14 English-French test set

• GPT-2 gets 11.5 BLEU on WMT-14 French-English test set

• Outperforms several unsupervised machine translation baselines (2017)

• But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019)
!22

Translation
• Surprising result!

• Authors of GPT-2 deliberately removed non-English webpages from WebText as a
ﬁltering step

• Authors ran byte-level language detector on WebText

• Only 10MB of data in the French language

• (Approximately 500x smaller than the monolingual French corpus common in prior
unsupervised machine translation research)
!23

Question Answering
• GPT-2 answers 4.1% of questions correctly when evaluated by the exact match metric
commonly used on reading comprehension datasets like SQUAD

• Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which
returns the most common answer for each question type (who, what, where, etc…)

• -> Model capacity is important

• But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most conﬁdent in
!24

Generalization vs Memorization
• It is important to analyze how much test data also shows up in the training data
• Using Bloom Filters, authors found out what percentage of (test) dataset is found in
WebText training set.
!25

Conclusionss
• Unsupervised task learning is an additional promising area of research to explore

• Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting.

• on reading comprehension

• but not on other tasks like summarization, etc…

• Studied zero-shot performance of WebText LMs on many canonical NLP tasks
!27

Personal Thoughts
• Rather than focusing on novel model architecture, the paper focuses on unsupervised
task learning, evaluating / analyzing on various canonical datasets / tasks

• Compared to the hype, the model is quite less achieving

• Scaling is important. Modern research by huge companies have already transitioned to
huge models

• Zero-shot learning is interesting
!29

How do you think about  
OpenAI not releasing the model?
(Is it ethical for OpenAI to keep the big model private?)
• Propagate Fear

• Reproducibility issue

• Making unnecessary hype
!30
• May be used for malicious use such as

• Generate misleading news articles

• Automate the production of abusive or faked
content to post on social media

• Automate the production of spam/phishing
content

GPT-2: Language Models are Unsupervised Multitask Learners

In this document

More Related Content

What's hot

Similar to GPT-2: Language Models are Unsupervised Multitask Learners

Recently uploaded

GPT-2: Language Models are Unsupervised Multitask Learners