Language Models are 

Unsupervised Multitask Learners

(GPT-2)
OpenAI
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever
2019.03.03
Presented by Young Seok Kim
PR-145
Articles & Useful Links
• Official

• Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf

• Blog: https://blog.openai.com/better-language-models/

• GitHub: https://github.com/openai/gpt-2

• Unofficial

• Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/
!2
Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049: https://youtu.be/6zGgVIlStXs

• Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html 

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website: https://blog.openai.com/language-unsupervised/

• Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/
language_understanding_paper.pdf

• Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language
Understanding.” (2018)

• Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• Paper: https://arxiv.org/abs/1810.04805

• PR-121: https://youtu.be/GK4IO3qOnLc
!3
Dataset
!4
Dataset (BERT)
!5
BookCorpus
(800M words) Wikipedia
(2500M words)
+
Common Crawl?
!6
• Significant data quality issues.

• Best results were achieved when using a small
subsample of common crawl which included only
documents most similar to the target dataset

• Authors of GPT-2 wanted to avoid making
assumptions about the tasks to be performed
ahead of time.
WebText
• GPT-2 authors created a new web scrape which
emphasizes document quality

• They scraped web pages which have been curated/
filtered by humans

• Manually filtering a full web scrape would be
exceptionally expensive

• Scraped all outbound links from Reddit, which
received at least 3 karma

• Heuristic indicator for whether other users found the
link interesting / educational / or just funny
!7
Karma > 3
WebText
• 45 million links

• Used content extractors to extract the text from HTML

• De-duplication

• heuristic based cleaning

• slightly over 8 million documents

• 40 GB of text

• Removed ALL Wikipedia documents

• since it is a coomon data source for other datasets and could complicate analysis due to overlapping
training data with test evaluation tasks
!8
Input Representation
!9
Byte Pair Encoding (BPE)
• Sennrich, Rico et al. 

“Neural Machine Translation of Rare Words with Subword Units.” (2016)

• Practical middle ground between character level and word level language modeling

• Effectively interpolates between word level inputs for frequent symbol sequences and
character level inputs for infrequent symbol sequences

• Combined empirical benefits of word-level LMs with the generality of byte-level
approaches

• This approach can assign a probability to any Unicode string, regardless of pre-
processing, tokenization or vocabulary size
!10
Byte Pair Encoding
(BPE)
Sennrich, Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)
Model
!12
Transformer
• Transformer-based 

• Follows the details of GPT-1

• Layer Normalization was moved to the input of each sub-block 

(similar to pre-activation in ResNet)

• Additional LayerNorm was added after the final self-attention
block.

• Vocab is expanded to 50,257

• Batchsize of 512 is used
!13
Original Transformer
Experiments
!14
Model sizes
!15
(BERT)GPT-2
GPT-1
BERT-large
GPT-2
Zero-shot results
!16
Children’s Book Test
• Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit
Memory Representations.” (2016)

• Reports accuracy on automatically constructed cloze test where the task is to predict
which of 10 possible choices for an omitted word is correct.

• GPT-2 authors compute the probability of each choice and the rest of sentence
conditioned on this choice according to LM, and predict the one with highest
probability.
!17
LAMBADA
• LAnguage Modeling Broadened to Account for Discourse Aspects

• Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad
discourse context.” (2016)

• Task is to predict the final word of sentences which require at least 50 tokens of
context for a human to successfully predict

• 99.8 PPL -> 8.63 PPL
!18
Winograd Schema Challenge
• Commonsense reasoning by
measuring its ability to resolve
ambiguities in text
!19
Winograd Schema Challenge
Trinh, Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)
Summarization
• Added text “TL;DR:” after the
article and generated 100 tokens
with Top-k random sampling with
k=2

• CNN and Daily Mail dataset

• Used 3 generated sentences in
these 100 tokens as the summary
!21
Translation
• ‘english sentence = french sentence’ format

• Generate text after ‘english sentence = ’

• Sample from the model with greedy decoding and use the first generated sentence as the translation

• GPT-2 gets 5 BLEU on WMT-14 English-French test set

• GPT-2 gets 11.5 BLEU on WMT-14 French-English test set

• Outperforms several unsupervised machine translation baselines (2017)

• But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019)
!22
Translation
• Surprising result!

• Authors of GPT-2 deliberately removed non-English webpages from WebText as a
filtering step

• Authors ran byte-level language detector on WebText

• Only 10MB of data in the French language

• (Approximately 500x smaller than the monolingual French corpus common in prior
unsupervised machine translation research)
!23
Question Answering
• GPT-2 answers 4.1% of questions correctly when evaluated by the exact match metric
commonly used on reading comprehension datasets like SQUAD

• Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which
returns the most common answer for each question type (who, what, where, etc…)

• -> Model capacity is important

• But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most confident in
!24
Generalization vs Memorization
• It is important to analyze how much test data also shows up in the training data
• Using Bloom Filters, authors found out what percentage of (test) dataset is found in
WebText training set.
!25
WebText Underfitting
!26
Conclusionss
• Unsupervised task learning is an additional promising area of research to explore 

• Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. 

• on reading comprehension

• but not on other tasks like summarization, etc…

• Studied zero-shot performance of WebText LMs on many canonical NLP tasks
!27
Discussions
!28
Personal Thoughts
• Rather than focusing on novel model architecture, the paper focuses on unsupervised
task learning, evaluating / analyzing on various canonical datasets / tasks

• Compared to the hype, the model is quite less achieving 

• Scaling is important. Modern research by huge companies have already transitioned to
huge models

• Zero-shot learning is interesting
!29
How do you think about 

OpenAI not releasing the model?
(Is it ethical for OpenAI to keep the big model private?)
• Propagate Fear

• Reproducibility issue

• Making unnecessary hype
!30
• May be used for malicious use such as

• Generate misleading news articles

• Automate the production of abusive or faked
content to post on social media

• Automate the production of spam/phishing
content
Thank you!
!31

GPT-2: Language Models are Unsupervised Multitask Learners

  • 1.
    Language Models are
 Unsupervised Multitask Learners
 (GPT-2) OpenAI Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever 2019.03.03 Presented by Young Seok Kim PR-145
  • 2.
    Articles & UsefulLinks • Official • Technical Paper: https://d4mucfpksywv.cloudfront.net/better-language-models/language-models.pdf • Blog: https://blog.openai.com/better-language-models/ • GitHub: https://github.com/openai/gpt-2 • Unofficial • Reddit: https://www.reddit.com/r/MachineLearning/comments/aqlzde/r_openai_better_language_models_and_their/ !2
  • 3.
    Related Papers • Vaswani,Ashish et al. “Attention Is All You Need.” NIPS (2017) • PR-049: https://youtu.be/6zGgVIlStXs • Tutorial with code: http://nlp.seas.harvard.edu/2018/04/03/attention.html • Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018) • Website: https://blog.openai.com/language-unsupervised/ • Paper: https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/ language_understanding_paper.pdf • Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” (2018) • Website: https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • Paper: https://arxiv.org/abs/1810.04805 • PR-121: https://youtu.be/GK4IO3qOnLc !3
  • 4.
  • 5.
  • 6.
    Common Crawl? !6 • Significantdata quality issues. • Best results were achieved when using a small subsample of common crawl which included only documents most similar to the target dataset • Authors of GPT-2 wanted to avoid making assumptions about the tasks to be performed ahead of time.
  • 7.
    WebText • GPT-2 authorscreated a new web scrape which emphasizes document quality • They scraped web pages which have been curated/ filtered by humans • Manually filtering a full web scrape would be exceptionally expensive • Scraped all outbound links from Reddit, which received at least 3 karma • Heuristic indicator for whether other users found the link interesting / educational / or just funny !7 Karma > 3
  • 8.
    WebText • 45 millionlinks • Used content extractors to extract the text from HTML • De-duplication • heuristic based cleaning • slightly over 8 million documents • 40 GB of text • Removed ALL Wikipedia documents • since it is a coomon data source for other datasets and could complicate analysis due to overlapping training data with test evaluation tasks !8
  • 9.
  • 10.
    Byte Pair Encoding(BPE) • Sennrich, Rico et al. 
 “Neural Machine Translation of Rare Words with Subword Units.” (2016) • Practical middle ground between character level and word level language modeling • Effectively interpolates between word level inputs for frequent symbol sequences and character level inputs for infrequent symbol sequences • Combined empirical benefits of word-level LMs with the generality of byte-level approaches • This approach can assign a probability to any Unicode string, regardless of pre- processing, tokenization or vocabulary size !10
  • 11.
    Byte Pair Encoding (BPE) Sennrich,Rico et al. “Neural Machine Translation of Rare Words with Subword Units.” (2016)
  • 12.
  • 13.
    Transformer • Transformer-based •Follows the details of GPT-1 • Layer Normalization was moved to the input of each sub-block 
 (similar to pre-activation in ResNet) • Additional LayerNorm was added after the final self-attention block. • Vocab is expanded to 50,257 • Batchsize of 512 is used !13 Original Transformer
  • 14.
  • 15.
  • 16.
  • 17.
    Children’s Book Test •Hill, Felix et al. “The Goldilocks Principle: Reading Children's Books with Explicit Memory Representations.” (2016) • Reports accuracy on automatically constructed cloze test where the task is to predict which of 10 possible choices for an omitted word is correct. • GPT-2 authors compute the probability of each choice and the rest of sentence conditioned on this choice according to LM, and predict the one with highest probability. !17
  • 18.
    LAMBADA • LAnguage ModelingBroadened to Account for Discourse Aspects • Paperno, Denis et al. “The LAMBADA dataset: Word prediction requiring a broad discourse context.” (2016) • Task is to predict the final word of sentences which require at least 50 tokens of context for a human to successfully predict • 99.8 PPL -> 8.63 PPL !18
  • 19.
    Winograd Schema Challenge •Commonsense reasoning by measuring its ability to resolve ambiguities in text !19
  • 20.
    Winograd Schema Challenge Trinh,Trieu H. and Quoc V. Le. “A Simple Method for Commonsense Reasoning.” (2018)
  • 21.
    Summarization • Added text“TL;DR:” after the article and generated 100 tokens with Top-k random sampling with k=2 • CNN and Daily Mail dataset • Used 3 generated sentences in these 100 tokens as the summary !21
  • 22.
    Translation • ‘english sentence= french sentence’ format • Generate text after ‘english sentence = ’ • Sample from the model with greedy decoding and use the first generated sentence as the translation • GPT-2 gets 5 BLEU on WMT-14 English-French test set • GPT-2 gets 11.5 BLEU on WMT-14 French-English test set • Outperforms several unsupervised machine translation baselines (2017) • But still much worse than 33.5 BLEU of the current SOTA of unsupervised machine translation (2019) !22
  • 23.
    Translation • Surprising result! •Authors of GPT-2 deliberately removed non-English webpages from WebText as a filtering step • Authors ran byte-level language detector on WebText • Only 10MB of data in the French language • (Approximately 500x smaller than the monolingual French corpus common in prior unsupervised machine translation research) !23
  • 24.
    Question Answering • GPT-2answers 4.1% of questions correctly when evaluated by the exact match metric commonly used on reading comprehension datasets like SQUAD • Smallest model does not exceed 1.0% accuracy of an incredibly simple baseline which returns the most common answer for each question type (who, what, where, etc…) • -> Model capacity is important • But, GPT-2 has an accuracy of 63.1% on the 1% of the questions it is most confident in !24
  • 25.
    Generalization vs Memorization •It is important to analyze how much test data also shows up in the training data • Using Bloom Filters, authors found out what percentage of (test) dataset is found in WebText training set. !25
  • 26.
  • 27.
    Conclusionss • Unsupervised tasklearning is an additional promising area of research to explore • Performance of GPT-2 is competitive with supervised baselines in a zero-shot setting. • on reading comprehension • but not on other tasks like summarization, etc… • Studied zero-shot performance of WebText LMs on many canonical NLP tasks !27
  • 28.
  • 29.
    Personal Thoughts • Ratherthan focusing on novel model architecture, the paper focuses on unsupervised task learning, evaluating / analyzing on various canonical datasets / tasks • Compared to the hype, the model is quite less achieving • Scaling is important. Modern research by huge companies have already transitioned to huge models • Zero-shot learning is interesting !29
  • 30.
    How do youthink about 
 OpenAI not releasing the model? (Is it ethical for OpenAI to keep the big model private?) • Propagate Fear • Reproducibility issue • Making unnecessary hype !30 • May be used for malicious use such as • Generate misleading news articles • Automate the production of abusive or faked content to post on social media • Automate the production of spam/phishing content
  • 31.