BERT: Pre-training of 

Deep Bidirectional Transformers
for Language Understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova
!1
Google AI Language
2018.11.25
Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805
Articles & Useful Links
• Official

• ArXiv : https://arxiv.org/abs/1810.04805

• Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html

• GitHub : https://github.com/google-research/bert

• Unofficial

• Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-
language-model-for-nlp/

• Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep-
bidirectional-transformers-for-language-understanding
!2
Related Papers
• Vaswani, Ashish et al. “Attention Is All You Need.” NIPS (2017)

• PR-049 : https://youtu.be/6zGgVIlStXs

• Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html 

• Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018)

• Website : https://blog.openai.com/language-unsupervised/

• Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-
unsupervised/language_understanding_paper.pdf

• Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
Understanding.” (2018)

• Website : https://gluebenchmark.com/

• Paper : https://arxiv.org/abs/1804.07461
!3
Preliminaries
!4
Attention is All you need
• Introduced Transformer module

• Reduced computational complexity in respect to
the sequence length
!5
GLUE
• Benchmark introduced in 

Wang, Alex et al. “GLUE: A Multi-Task
Benchmark and Analysis Platform for
Natural Language
Understanding.” (2018)

• Contains 11 Tasks
!6
BERT

Bidirectional Encoder
Representations from Transformers
!7
Motivation
!8
Traditional RNN / LSTM / GRU units
Motivation
!9
Commonly used Bidirectional units
Motivation
Problem
• Unfortunately, standard conditional language models can only be trained left-to-right or
right-to-left, since bidirectional conditioning would allow each word to indirectly “see
itself” in a multi-layered context.
!11
Problem
E 1
T 1
E 2
… EN
Transformer Transformer Transformer…
T 2
TN
…
Single Transformer Layer
!13
E
1
T 1
E
2
… E
N
Transformer Transformer Transformer…
T2
TN
Transformer Transformer Transformer…
…
Multi-layer Transformer Layer
!14
E
1
T 1
E
2
… E
N
Transformer Transformer Transformer…
T2
TN
Transformer Transformer Transformer…
…
Multi-layer Transformer Layer
Training Method
Task #1 - Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)
Task #1 - Masked LM
!16
• Fill in the blank!

• Formally, Cloze Test

(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?
Task #1 - Masked LM
!17
• Fill in the blank!

• Formally, Cloze Test

(https://en.wikipedia.org/wiki/Cloze_test)

• Similar to CBOW in Word2Vec?
My dog is hairy My dog is hairy
Choose 15% of tokens at random
80%
10%
10%
My dog is [Mask]
My dog is hairy
My dog is apple
Masked LM Procedure
Task #2 - Next Sentence Prediction (NSP)
• Classification - [IsNext, NotNext]

• Final pre-trained model achieved 97-98%
accuracy.
!19
Embedding
!20
!21
!22
• The first token of every sequence is
always the special classification
embedding [CLS]. The final hidden state
corresponding to this token is used as the
aggregate sequence representation for
classification tasks. For non-classifcation
tasks, this vector is ignored.
• Sentence pairs are packed together into a
single sequence. The authors separate
them in two ways.

1. Separate with special token [SEP].

2. Add learned sentence embedding to
every token of corresponding
sentence.
Corpus
• BookCorpus (800M words)

• English Wikipedia (2500M words)

• Training dataset

• 50% - Two adjacent sentences

• 50% - Random sentence after a sentence.
!23
Differences between OpenAI GPT
!24
Model Corpus [CLS] / [SEP] tokens Steps Learning rate
BERT
BooksCorpus 

+

Wikipedia
Learns during

pre-training
1M steps with batch
size of 

128,000 words
Task-specific

fine-tuning 

learning rate
OpenAI GPT BooksCorpus
Only introduced at 

fine-tuning time 

1M steps with batch
size of 

32,000 words
Same learning rate of
5e-5
Results
!25
Results 

GLUE benchmark
!26
GLUE Benchmark
• MNLI: Multi-Genre Natural Language Inference 

• Given a pair of sentences, the goal is to predict whether the second sentence is an
entailment, contradiction, or neutral with respect to the first sentence.

• Two versions - MNLI matched, MNLI mismatched

• Two sentence, classification task
!27
GLUE Benchmark
• QQP: Quora Question Pairs

• Quora Question Pairs is a binary classification task where the goal is to determine if
two questions asked on Quora are semantically equivalent 

• Two sentence, binary classification task
!28
GLUE Benchmark
• QNLI: Question Natural Language Inference 

• The positive examples are (question, sentence) pairs which do contain the correct
answer, and the negative examples are (question, sentence) from the same paragraph
which do not contain the answer. 

• Two sentence, binary classification task
!29
GLUE Benchmark
• SST-2: Stanford Sentiment Treebank 

• Binary single-sentence classification task consisting of sentences extracted from
movie reviews with human annotations of their sentiment 

• One sentence, binary classification task
!30
GLUE Benchmark
• CoLA: Corpus of Linguistic Acceptability 

• Binary single-sentence classification task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not 

• One sentence, binary classification task
!31
GLUE Benchmark
• STS-B: The Semantic Textual Similarity Bench- mark 

• Binary single-sentence classification task, where the goal is to predict whether an
English sentence is linguistically “acceptable” or not 

• One sentence, binary classification task
!32
GLUE Benchmark
• MRPC: Microsoft Research Paraphrase Corpus 

• Consists of sentence pairs automatically extracted from online news sources, with
human annotations for whether the sentences in the pair are semantically equivalent 

• Two sentence, binary classification task
!33
GLUE Benchmark
• RTE: Recognizing Textual Entailment 

• A binary entailment task similar to MNLI, but with much less training data 

• Two sentence, binary classification task
!34
GLUE Benchmark
• WNLI: Winograd Natural Language Inference

• A binary entailment task similar to MNLI, but with much less training data 

• The GLUE webpage notes that there are issues with the construction of this dataset 

• Authors therefore exclude this set
!35
!36
SQuAD v1.1
• Standford Question Answering Dataset (SQuAD) is a collection of 100k crowdsourced
question/answer pairs 

• Given a question and a paragraph from Wikipedia containing the answer, the task is to
predict the answer text span in the paragraph
!38
Results on SQuAD v1.1
!39
SWAG
• Situations With Adversarial Generations Dataset

• Given a sentence from a video captioning dataset, the task is to decide among four
choices the most plausible continuation.
!40
SWAG Results
GLUE Results
Ablation Study
!43
Model size
!44
Conclusion
• Unsupervised pre-training is now an integral part of many language understanding
systems.

• Now models can be truly trained with deep bidirectional architectures.

• State-of-the-art on almost every NLP tasks, in some cases surpassing human
performance.
!45
Personal thoughts
• Paper is well written and easy to follow

• SOTA in not just one task/dataset but in almost all tasks

• I think this method is going to be used universally as a baseline for future NLP research

• More objective comparison between BERT and OpenAI GPT was possible because the
baseline parameters are chosen such that it is almost identical to OpenAI GPT

• Model looks very simple but at the same time very flexible to adapt towards various tasks
with simple modifications on the top layer

• Unsupervised pre-training and supervised fine-tuning might prevail in many domain.
!46
Thank you!
!47
References
• Images are either from 

• original papers or

• https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional-
rnn-lstm-gru-73927ec9df15

• https://colah.github.io/posts/2015-08-Understanding-LSTMs/

• https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for-
nlp/
!48

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

  • 1.
    BERT: Pre-training of
 Deep Bidirectional Transformers for Language Understanding Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova !1 Google AI Language 2018.11.25 Presented by Young Seok Kimhttps://arxiv.org/abs/1810.04805
  • 2.
    Articles & UsefulLinks • Official • ArXiv : https://arxiv.org/abs/1810.04805 • Blog : https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html • GitHub : https://github.com/google-research/bert • Unofficial • Lyrn.ai blog : https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art- language-model-for-nlp/ • Korean blog : https://rosinality.github.io/2018/10/bert-pre-training-of-deep- bidirectional-transformers-for-language-understanding !2
  • 3.
    Related Papers • Vaswani,Ashish et al. “Attention Is All You Need.” NIPS (2017) • PR-049 : https://youtu.be/6zGgVIlStXs • Tutorial with code : http://nlp.seas.harvard.edu/2018/04/03/attention.html • Radford, Alec. “Improving Language Understanding by Generative Pre-Training.” (2018) • Website : https://blog.openai.com/language-unsupervised/ • Paper : https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language- unsupervised/language_understanding_paper.pdf • Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” (2018) • Website : https://gluebenchmark.com/ • Paper : https://arxiv.org/abs/1804.07461 !3
  • 4.
  • 5.
    Attention is Allyou need • Introduced Transformer module • Reduced computational complexity in respect to the sequence length !5
  • 6.
    GLUE • Benchmark introducedin 
 Wang, Alex et al. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding.” (2018) • Contains 11 Tasks !6
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    Problem • Unfortunately, standardconditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself” in a multi-layered context. !11
  • 12.
    Problem E 1 T 1 E2 … EN Transformer Transformer Transformer… T 2 TN … Single Transformer Layer
  • 13.
    !13 E 1 T 1 E 2 … E N TransformerTransformer Transformer… T2 TN Transformer Transformer Transformer… … Multi-layer Transformer Layer
  • 14.
    !14 E 1 T 1 E 2 … E N TransformerTransformer Transformer… T2 TN Transformer Transformer Transformer… … Multi-layer Transformer Layer
  • 15.
    Training Method Task #1- Masked Language Model (MLM) Task #2 - Next Sentence Prediction (NSP)
  • 16.
    Task #1 -Masked LM !16 • Fill in the blank! • Formally, Cloze Test
 (https://en.wikipedia.org/wiki/Cloze_test) • Similar to CBOW in Word2Vec?
  • 17.
    Task #1 -Masked LM !17 • Fill in the blank! • Formally, Cloze Test
 (https://en.wikipedia.org/wiki/Cloze_test) • Similar to CBOW in Word2Vec?
  • 18.
    My dog ishairy My dog is hairy Choose 15% of tokens at random 80% 10% 10% My dog is [Mask] My dog is hairy My dog is apple Masked LM Procedure
  • 19.
    Task #2 -Next Sentence Prediction (NSP) • Classification - [IsNext, NotNext] • Final pre-trained model achieved 97-98% accuracy. !19
  • 20.
  • 21.
  • 22.
    !22 • The firsttoken of every sequence is always the special classification embedding [CLS]. The final hidden state corresponding to this token is used as the aggregate sequence representation for classification tasks. For non-classifcation tasks, this vector is ignored. • Sentence pairs are packed together into a single sequence. The authors separate them in two ways. 1. Separate with special token [SEP]. 2. Add learned sentence embedding to every token of corresponding sentence.
  • 23.
    Corpus • BookCorpus (800Mwords) • English Wikipedia (2500M words) • Training dataset • 50% - Two adjacent sentences • 50% - Random sentence after a sentence. !23
  • 24.
    Differences between OpenAIGPT !24 Model Corpus [CLS] / [SEP] tokens Steps Learning rate BERT BooksCorpus + Wikipedia Learns during
 pre-training 1M steps with batch size of 
 128,000 words Task-specific
 fine-tuning 
 learning rate OpenAI GPT BooksCorpus Only introduced at 
 fine-tuning time 1M steps with batch size of 
 32,000 words Same learning rate of 5e-5
  • 25.
  • 26.
  • 27.
    GLUE Benchmark • MNLI:Multi-Genre Natural Language Inference • Given a pair of sentences, the goal is to predict whether the second sentence is an entailment, contradiction, or neutral with respect to the first sentence. • Two versions - MNLI matched, MNLI mismatched • Two sentence, classification task !27
  • 28.
    GLUE Benchmark • QQP:Quora Question Pairs • Quora Question Pairs is a binary classification task where the goal is to determine if two questions asked on Quora are semantically equivalent • Two sentence, binary classification task !28
  • 29.
    GLUE Benchmark • QNLI:Question Natural Language Inference • The positive examples are (question, sentence) pairs which do contain the correct answer, and the negative examples are (question, sentence) from the same paragraph which do not contain the answer. • Two sentence, binary classification task !29
  • 30.
    GLUE Benchmark • SST-2:Stanford Sentiment Treebank • Binary single-sentence classification task consisting of sentences extracted from movie reviews with human annotations of their sentiment • One sentence, binary classification task !30
  • 31.
    GLUE Benchmark • CoLA:Corpus of Linguistic Acceptability • Binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not • One sentence, binary classification task !31
  • 32.
    GLUE Benchmark • STS-B:The Semantic Textual Similarity Bench- mark • Binary single-sentence classification task, where the goal is to predict whether an English sentence is linguistically “acceptable” or not • One sentence, binary classification task !32
  • 33.
    GLUE Benchmark • MRPC:Microsoft Research Paraphrase Corpus • Consists of sentence pairs automatically extracted from online news sources, with human annotations for whether the sentences in the pair are semantically equivalent • Two sentence, binary classification task !33
  • 34.
    GLUE Benchmark • RTE:Recognizing Textual Entailment • A binary entailment task similar to MNLI, but with much less training data • Two sentence, binary classification task !34
  • 35.
    GLUE Benchmark • WNLI:Winograd Natural Language Inference • A binary entailment task similar to MNLI, but with much less training data • The GLUE webpage notes that there are issues with the construction of this dataset • Authors therefore exclude this set !35
  • 36.
  • 38.
    SQuAD v1.1 • StandfordQuestion Answering Dataset (SQuAD) is a collection of 100k crowdsourced question/answer pairs • Given a question and a paragraph from Wikipedia containing the answer, the task is to predict the answer text span in the paragraph !38
  • 39.
  • 40.
    SWAG • Situations WithAdversarial Generations Dataset • Given a sentence from a video captioning dataset, the task is to decide among four choices the most plausible continuation. !40
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
    Conclusion • Unsupervised pre-trainingis now an integral part of many language understanding systems. • Now models can be truly trained with deep bidirectional architectures. • State-of-the-art on almost every NLP tasks, in some cases surpassing human performance. !45
  • 46.
    Personal thoughts • Paperis well written and easy to follow • SOTA in not just one task/dataset but in almost all tasks • I think this method is going to be used universally as a baseline for future NLP research • More objective comparison between BERT and OpenAI GPT was possible because the baseline parameters are chosen such that it is almost identical to OpenAI GPT • Model looks very simple but at the same time very flexible to adapt towards various tasks with simple modifications on the top layer • Unsupervised pre-training and supervised fine-tuning might prevail in many domain. !46
  • 47.
  • 48.
    References • Images areeither from • original papers or • https://towardsdatascience.com/introduction-to-sequence-models-rnn-bidirectional- rnn-lstm-gru-73927ec9df15 • https://colah.github.io/posts/2015-08-Understanding-LSTMs/ • https://www.lyrn.ai/2018/11/07/explained-bert-state-of-the-art-language-model-for- nlp/ !48