Thomas Delteil – Machine Learning Scientist @ AWS AI
tdelteil@amazon.com
8th March 2018
Recent advances in
Natural Language Processing
Objective
- NLP domain overview
- Traditional methods
- Word Embeddings (word2vec)
- Contextualized word embeddings (ELMo)
- Bidirectional Encoder Representation from Transformers
(BERT)
- Generative Pre-Training 2 (GPT-2)
What is covered in NLP
Text classification
Language Modelling
𝑃(𝑤𝑡|𝑤𝑡−1, 𝑤𝑡−2, … )
See you later […]
alligator
today
𝑃(𝑤𝑡|𝑤𝑡+1, 𝑤𝑡+2, … )
[…] abhors a vacuum
Nature
Fido
David Gascoyne
Automatic Text Generation
http://botpoet.com
The crow crooked on more beautiful and free,
He journeyed off into the quarter sea.
His radiant ribs girdled empty and very
least beautiful as dignified to see.
The smooth plain with its mirrors listens to the cliff
Like a basilisk eating flowers.
And the children, lost in the shadows of the catacombs,
Call to the mirrors for help:
“Strong-bow of salt, cutlass of memory,
Write on my map the name of every river.”
Natural Language Understanding
Alexa, remind me to
buy groceries after work
Intent detection:
Create Reminder
Slot filling:
What
When
Where
Alexa, remind me to
buy groceries after work
Machine Translation
Sometimes, in the morning, I wonder whether AI bots will kill us all
時々、午前中に、AIボットが私たち全員を殺すのだろうか?
Text Summarization
A Neural Attention Model for Abstractive Sentence Summarization, Alexander M. Rush et al. 2015
Question Answering:
“Who was president when Barack Obama was born?”
John Fitzgerald Kennedy
Part of speech tagging
Sentence similarity
Commonsense Reasoning
Coreference Resolution
…
Classical Methods
Text representation:
Lexicon based  quickly explodes with N >> 10000
 Text preprocessing
Text Pre-Processing
I’d love to drive again in the mountainous roads of Crete.
I would love to drive again in the mountainous roads of crete.
I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · .
I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · .
would · love · drive · again · mountainous · roads · crete · .
would · love · drive · again · mountain · road · crete · .
Normalization Tokenization Stop words removal Lemmatization
Grapheme/Token representation: One-Hot encoding
Define words as a vector
I’d love to drive … preprocessing
drive
love
would
would love drive1
0
0
0
1
0
0
0
1
Sentence representation: Bag of words
Sum of one-hot encoded word vectors
I’d love to drive …
drive
love
would
I’d love to drive
Dictionary size = 3
If dictionary size >>> 1
Very sparse!
0
0
0
0
1
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
3
0
0
0
0
0
0
0
0
0
0
0
1
0
0
0
1
0
1
1
1
TF*IDF
Term frequency inverse document frequency
TF =
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐
𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠
IDF = 𝑙𝑛
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛
0
0
0
0
2.3
0
0
0.1
0
0
0
0
0
0
0
0
0
0
0
0
8
0
0
0
0
0
0
0
0
0
0
0
1.2
0
0
0
0.5
0
Classifiers
SVM
MLP
Naïve Bayes
XGBoost
Limitations: no semantic information
With one-hot encoding:
0
0
0
1
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
- -= = √2
2 2
|| vautomobile – vcar ||2 = || vautomobile – vmountain ||2 = √2
Ideally we would want:
|| vautomobile – vcar ||2 ≈ 0
Word order matters
• Context dependent information
• The place of the word in the sentence matters
My kindle is easy to use,
I do not need help
I do need help, my kindle
is not easy to use
Better grapheme representation
Better context understanding
Word2vec: Efficient Estimation of Word Representations in Vector Space
Mikolov et al. 13 2013
Learn word embeddings:
Skip-gram: predict context given center word
Continuous Bag of Words (CBOW): predict center word given context
CBOW model
… The cake is a lie …
Context words at
t-2 and t-1
Context words at
t+1 and t+2
Word to predict at t
Estimate: 𝑃(𝑤𝑡|𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡+2)
Learning process ℒ = −log(𝑃 𝑤𝑡 𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡 )
source: https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
source: https://opensource.googleblog.com/2013/08/learning-meaning-behind-words.html
Using Word Representation in Neural Networks
Amazon is amazing
2910 79 1927
W2910 W79 W1927
W1
W2
…
Wi
…
W|V|
Neural Layers
?
Output Layer
{Wi} are the word embeddings. They are
parameters that the Neural Networks
can modify through. Can be pre-trained.
indexing
lookup
1
|V|
N
Recurrent Neural Networks
Recurrent Neural Network: Language Modelling
RNNRNNRNN
h0 h1 h2
hinit
RNN
h3
Proj Proj Proj Proj
𝑃 𝑤 ℎ0 𝑃 𝑤 ℎ1 𝑃 𝑤 ℎ2 𝑃 𝑤 ℎ3
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1
N
1
1 2910 79 1927
𝑙𝑜𝑠𝑠 = − log 𝑃( 𝑤=𝐴𝑚𝑎𝑧𝑜𝑛 ℎ0)) − log 𝑃( 𝑤=𝑖𝑠 ℎ1)) − log 𝑃( 𝑤=𝑎𝑚𝑎𝑧𝑖𝑛𝑔 ℎ2)) − log 𝑃( 𝑤=<𝐸𝑂𝑆> ℎ3))
Convolutional Neural Network for Text Classification
Source: Character-level Convolutional Networks for Text Classification,
Zhang et al. 15
Embeddings
Time
Time
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1 C0,0
C
C
N
1
N
T
1 2910 79 1927 2
C0,0
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1 C0,0
C0,1
C
N
1
N
T
1 2910 79 1927 2
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C
C0,0
C0,1
C0,2
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,0
C5,1
C5,2
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
C0,0
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
C0,0
C0,0
C0,1
C0,0
C0,1
C3,0C3,0
C3,1
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
C0,0
C0,0
C0,1
C0,0
C0,1
C3,0
C3,1
Receptive field allows
long range
dependencies
Convolutional Neural Network
<BOS> Amazon is amazing <EOS>
W2910 W79 W1927W1 W2
W2910
W79
W2
W1927
W1
N
1
N
T
1 2910 79 1927 2
C0,0
C0,0
C0,1
C0,0
C0,1
C3,0
C3,1
…
x0
x1
x…
xn-2
xn-1
xn
Wpos
Wneut
Wneg
softmax
Pos 92%
Neutral 8%
Neg 0%
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C0,0
C0,1
C0,2
C5,0
C5,1
C5,2
Limitations
• Rare words are not well represented or just <UNK>
Half-way solutions:
• fastText and sum of subwords embeddings
• Character ngrams
• Byte Pair Encoding (BPE)
Limitations
Polysemy: meaning of a word
• Java
• Python
Depends on the context
• I love travelling. I am going to explore Java.
https://en.wikipedia.org/wiki/Java
Context can be bidirectional:
I went to the bank, to drop off some money
Context can be bidirectional:
I went to the bank, to drop off some money
Limitations
ELMo Embeddings (Peters et al. 18)
Contextualized word embeddings
𝑥 𝑏𝑜𝑠 𝑥1 𝑥2 𝑥3 𝑥 𝑛 𝑥 𝑒𝑜𝑠
Embedding (Char-CNN)
𝒉 𝒉 𝒉 𝒉
𝒉𝒉𝒉 𝒉
𝑆𝑖𝑆0
𝑆0
′
𝑆𝑖
′
Θ 𝑒
Θ𝑗 𝐿𝑆𝑇𝑀
Θ𝑗 𝐿𝑆𝑇𝑀
SoftmaxΘ 𝑠
𝑦1 𝑦2 𝑦3 𝑦 𝑛Pre-Training on bidirectional
language modelling:
ELMo Embeddings (Peters et al. 18))
𝑥 𝑏𝑜𝑠 𝑥1 𝑥2 𝑥3 𝑥 𝑛 𝑥 𝑒𝑜𝑠
Embedding (Char-CNN)
𝒉 𝒉
𝒉𝒉 𝒉
𝑆0
𝑆0
′
Θ 𝑒
Θ𝑗 𝐿𝑆𝑇𝑀
Θ𝑗 𝐿𝑆𝑇𝑀
SoftmaxΘ 𝑠
𝑦1 𝑦2 𝑦3 𝑦 𝑛
ELMo Embeddings (Peters et al. 18)
Contextualized word embeddings
Fine-Tuning:
Task Specific Neural Network
𝑅1 𝑅2 𝑅3 𝑅 𝑛
Learnt linear combination of hidden states
Embedding (Char-CNN)
ELMo Embeddings (Peters et al. 18)
ELMo
Softmax Layer
Input sentence
Output (probabilities over V)
Pretraining on Language Model
Your task NN
Input sentence
Output
Train on your Task
BERT (Devlin et al. 18)
Bidirectional Encoder Representations from Transformers (BERT)
BERT (Devlin et al. 18)
Inspired by “Improving Language Understanding by Generative Pre-Training”,
Radford et al. 2018 GPT-1 OpenAI
Based on Transformer and the multi-head self-attention model
Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
BERT (Devlin et al. 18)
Self-attention: “The apple is red, it is delicious”
The apple is red , it is delicious
The apple is red , it is delicious
The apple is red , it is delicious
BERT (Devlin et al. 18)
INPUT
WordPieces
Embeddings
Sentence
Embeddings
Position
Embeddings
BERT INPUT REPRESENTATION:
Learned during the
(pre)training
process
MASK
EMASK
In pre-training 15% of the input tokens
are masked for the masked LM task
Training objects in slightly modified BERT models for downstream
tasks. (Image source: original paper)
Fine-tuning
For classification tasks:
token 𝐶𝐿𝑆 , 𝒉 𝐿
[𝐶𝐿𝑆]
Small weight matrix W:
𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒉 𝐿
𝐶𝐿𝑆
𝑾 𝐶𝐿𝑆)
BERT (Devlin et al. 18)
- No need for custom Neural Network for fine-tuning
BERT (Devlin et al. 18)
XLNet (Yang et al. 19)
XLNet: Generalized Autoregressive Pretraining for Language
Understanding
Problems with Bert:
1. The [MASK] token used in training does not appear during fine-tuning
2. BERT generates predictions independently
I went to [MASK] [MASK] and saw the [MASK] [MASK] [MASK].
XLNet (Yang et al. 19)
XLNet: Generalized Autoregressive Pretraining for Language
Understanding
Bidirectionnal context through randomized prediction of ordered tokens
OpenAI GPT-2 (Radford et al. 19)
Language Models are Unsupervised Multitask Learners
Source: original paper
Trained on language model task
40GB corpus dataset
1.5B parameters
OpenAI GPT-2 (Radford et al. 19)
Language Models are Unsupervised Multitask Learners
All the downstream language tasks are framed as predicting conditional
probabilities and there is no task-specific fine-tuning.
Zero-shot learning:
Summarization:
𝑃 𝑤 “𝑡𝑒𝑥𝑡 𝑡𝑜 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒” + “ 𝑇𝐿; 𝐷𝑅: <? > ”)
Question Answering:
𝑃 𝑤 “𝑡𝑒𝑥𝑡” + “𝑄: … 𝐴: … 𝑄: … 𝐴: <? > ”)
Machine Translation:
𝑃 𝑤 “𝐼 𝑙𝑖𝑘𝑒 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟𝑠 = 𝐽′ 𝑎𝑖𝑚𝑒 𝑙𝑒𝑠 𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑢𝑟𝑠; 𝐼 𝑙𝑖𝑣𝑒 𝑖𝑛 𝑉𝑎𝑛𝑐𝑜𝑢𝑣𝑒𝑟 =<? > ”)
Source: https://blog.openai.com/better-language-models/
Source: original paper
OpenAI GPT-2 (Radford et al. 19)
Language Models are Unsupervised Multitask Learners
Conclusion
- Count-based word representation (tf-idf)
- Learnt word representation (word2vec)
- Contextualized embeddings + custom network (ELMo)
- Sentence embeddings + fine-tuning (BERT, XLNet)
- Zero-short transfer with large language model (GPT-2)
Language representation
Task specific adaptation
References
Word embeddings:
LSA - Indexing by latent semantic analysis, Dumais et al. 1990
Word2Vec - Efficient Estimation of Word Representations in Vector Space, Mikolov et al. 2013
GloVe - GloVe: Global Vectors for Word Representation. Pennington et al. 2014
Subword embeddings
CNN character embedding layer - Character-Aware Neural Language Models, Kim et al. 2015
FastText - Enriching Word Vectors with Subword Information, Bojanowski et al. 2017
WordPiece - Google’s NMT System: Bridging the Gap between Human and Machine Translation, Wu et al. 2016
Contextualized embeddings
ELMo - Deep contextualized word representations, Peters et al. 2018
CoVe - Learned in Translation: Contextualized Word Vectors, McCann et al. 2017
Pre-trained deep learning architecture
Transformer - Attention Is All You Need, Vaswani et al. 2017
OpenAI GPT - Improving language understanding with unsupervised learning, Radford et al. 2018
BERT - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018
OpenAI GPT-2 - Language Models are Unsupervised Multitask Learners, Radford et al. 2018
8th March 2018
Thank you!
tdelteil@amazon.com
github.com/ThomasDelteil
twitter.com/thdelteil
8th March 2018
GluonNLP Toolkit

Recent Advances in Natural Language Processing

  • 1.
    Thomas Delteil –Machine Learning Scientist @ AWS AI tdelteil@amazon.com 8th March 2018 Recent advances in Natural Language Processing
  • 2.
    Objective - NLP domainoverview - Traditional methods - Word Embeddings (word2vec) - Contextualized word embeddings (ELMo) - Bidirectional Encoder Representation from Transformers (BERT) - Generative Pre-Training 2 (GPT-2)
  • 3.
    What is coveredin NLP Text classification
  • 4.
    Language Modelling 𝑃(𝑤𝑡|𝑤𝑡−1, 𝑤𝑡−2,… ) See you later […] alligator today 𝑃(𝑤𝑡|𝑤𝑡+1, 𝑤𝑡+2, … ) […] abhors a vacuum Nature Fido
  • 5.
    David Gascoyne Automatic TextGeneration http://botpoet.com The crow crooked on more beautiful and free, He journeyed off into the quarter sea. His radiant ribs girdled empty and very least beautiful as dignified to see. The smooth plain with its mirrors listens to the cliff Like a basilisk eating flowers. And the children, lost in the shadows of the catacombs, Call to the mirrors for help: “Strong-bow of salt, cutlass of memory, Write on my map the name of every river.”
  • 6.
    Natural Language Understanding Alexa,remind me to buy groceries after work Intent detection: Create Reminder Slot filling: What When Where Alexa, remind me to buy groceries after work
  • 7.
    Machine Translation Sometimes, inthe morning, I wonder whether AI bots will kill us all 時々、午前中に、AIボットが私たち全員を殺すのだろうか? Text Summarization A Neural Attention Model for Abstractive Sentence Summarization, Alexander M. Rush et al. 2015
  • 8.
    Question Answering: “Who waspresident when Barack Obama was born?” John Fitzgerald Kennedy Part of speech tagging Sentence similarity Commonsense Reasoning Coreference Resolution …
  • 9.
    Classical Methods Text representation: Lexiconbased  quickly explodes with N >> 10000  Text preprocessing
  • 10.
    Text Pre-Processing I’d loveto drive again in the mountainous roads of Crete. I would love to drive again in the mountainous roads of crete. I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · . I · would · love · to · drive · again · in · the · mountainous · roads · of · crete · . would · love · drive · again · mountainous · roads · crete · . would · love · drive · again · mountain · road · crete · . Normalization Tokenization Stop words removal Lemmatization
  • 12.
    Grapheme/Token representation: One-Hotencoding Define words as a vector I’d love to drive … preprocessing drive love would would love drive1 0 0 0 1 0 0 0 1
  • 13.
    Sentence representation: Bagof words Sum of one-hot encoded word vectors I’d love to drive … drive love would I’d love to drive Dictionary size = 3 If dictionary size >>> 1 Very sparse! 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1
  • 14.
    TF*IDF Term frequency inversedocument frequency TF = 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑖𝑚𝑒𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑑𝑜𝑐 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑒𝑟𝑚𝑠 IDF = 𝑙𝑛 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑡ℎ𝑒 𝑡𝑒𝑟𝑚 𝑎𝑝𝑝𝑒𝑎𝑟𝑠 𝑖𝑛 0 0 0 0 2.3 0 0 0.1 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0 1.2 0 0 0 0.5 0 Classifiers SVM MLP Naïve Bayes XGBoost
  • 15.
    Limitations: no semanticinformation With one-hot encoding: 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 - -= = √2 2 2 || vautomobile – vcar ||2 = || vautomobile – vmountain ||2 = √2 Ideally we would want: || vautomobile – vcar ||2 ≈ 0
  • 16.
    Word order matters •Context dependent information • The place of the word in the sentence matters My kindle is easy to use, I do not need help I do need help, my kindle is not easy to use
  • 17.
  • 18.
    Word2vec: Efficient Estimationof Word Representations in Vector Space Mikolov et al. 13 2013 Learn word embeddings: Skip-gram: predict context given center word Continuous Bag of Words (CBOW): predict center word given context CBOW model … The cake is a lie … Context words at t-2 and t-1 Context words at t+1 and t+2 Word to predict at t Estimate: 𝑃(𝑤𝑡|𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡+2)
  • 19.
    Learning process ℒ= −log(𝑃 𝑤𝑡 𝑤𝑡−2, 𝑤𝑡−1, 𝑤𝑡+1, 𝑤𝑡 ) source: https://lilianweng.github.io/lil-log/2017/10/15/learning-word-embedding.html
  • 20.
  • 22.
    Using Word Representationin Neural Networks Amazon is amazing 2910 79 1927 W2910 W79 W1927 W1 W2 … Wi … W|V| Neural Layers ? Output Layer {Wi} are the word embeddings. They are parameters that the Neural Networks can modify through. Can be pre-trained. indexing lookup 1 |V| N
  • 23.
  • 24.
    Recurrent Neural Network:Language Modelling RNNRNNRNN h0 h1 h2 hinit RNN h3 Proj Proj Proj Proj 𝑃 𝑤 ℎ0 𝑃 𝑤 ℎ1 𝑃 𝑤 ℎ2 𝑃 𝑤 ℎ3 <BOS> Amazon is amazing <EOS> W2910 W79 W1927W1 N 1 1 2910 79 1927 𝑙𝑜𝑠𝑠 = − log 𝑃( 𝑤=𝐴𝑚𝑎𝑧𝑜𝑛 ℎ0)) − log 𝑃( 𝑤=𝑖𝑠 ℎ1)) − log 𝑃( 𝑤=𝑎𝑚𝑎𝑧𝑖𝑛𝑔 ℎ2)) − log 𝑃( 𝑤=<𝐸𝑂𝑆> ℎ3))
  • 25.
    Convolutional Neural Networkfor Text Classification Source: Character-level Convolutional Networks for Text Classification, Zhang et al. 15 Embeddings Time Time
  • 26.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 C0,0 C C N 1 N T 1 2910 79 1927 2
  • 27.
    C0,0 Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 C0,0 C0,1 C N 1 N T 1 2910 79 1927 2
  • 28.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 N 1 N T 1 2910 79 1927 2 C0,0 C0,1 C C0,0 C0,1 C0,2
  • 29.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 N 1 N T 1 2910 79 1927 2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C5,0 C5,1 C5,0 C5,1 C5,2
  • 30.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 N 1 N T 1 2910 79 1927 2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C5,0 C5,1 C5,2 C0,0
  • 31.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 N 1 N T 1 2910 79 1927 2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C5,0 C5,1 C5,2 C0,0 C0,0 C0,1 C0,0 C0,1 C3,0C3,0 C3,1
  • 32.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 N 1 N T 1 2910 79 1927 2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C5,0 C5,1 C5,2 C0,0 C0,0 C0,1 C0,0 C0,1 C3,0 C3,1 Receptive field allows long range dependencies
  • 33.
    Convolutional Neural Network <BOS>Amazon is amazing <EOS> W2910 W79 W1927W1 W2 W2910 W79 W2 W1927 W1 N 1 N T 1 2910 79 1927 2 C0,0 C0,0 C0,1 C0,0 C0,1 C3,0 C3,1 … x0 x1 x… xn-2 xn-1 xn Wpos Wneut Wneg softmax Pos 92% Neutral 8% Neg 0% C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C0,0 C0,1 C0,2 C5,0 C5,1 C5,2
  • 34.
    Limitations • Rare wordsare not well represented or just <UNK> Half-way solutions: • fastText and sum of subwords embeddings • Character ngrams • Byte Pair Encoding (BPE)
  • 35.
    Limitations Polysemy: meaning ofa word • Java • Python Depends on the context • I love travelling. I am going to explore Java. https://en.wikipedia.org/wiki/Java
  • 36.
    Context can bebidirectional: I went to the bank, to drop off some money Context can be bidirectional: I went to the bank, to drop off some money Limitations
  • 37.
    ELMo Embeddings (Peterset al. 18) Contextualized word embeddings 𝑥 𝑏𝑜𝑠 𝑥1 𝑥2 𝑥3 𝑥 𝑛 𝑥 𝑒𝑜𝑠 Embedding (Char-CNN) 𝒉 𝒉 𝒉 𝒉 𝒉𝒉𝒉 𝒉 𝑆𝑖𝑆0 𝑆0 ′ 𝑆𝑖 ′ Θ 𝑒 Θ𝑗 𝐿𝑆𝑇𝑀 Θ𝑗 𝐿𝑆𝑇𝑀 SoftmaxΘ 𝑠 𝑦1 𝑦2 𝑦3 𝑦 𝑛Pre-Training on bidirectional language modelling:
  • 38.
    ELMo Embeddings (Peterset al. 18)) 𝑥 𝑏𝑜𝑠 𝑥1 𝑥2 𝑥3 𝑥 𝑛 𝑥 𝑒𝑜𝑠 Embedding (Char-CNN) 𝒉 𝒉 𝒉𝒉 𝒉 𝑆0 𝑆0 ′ Θ 𝑒 Θ𝑗 𝐿𝑆𝑇𝑀 Θ𝑗 𝐿𝑆𝑇𝑀 SoftmaxΘ 𝑠 𝑦1 𝑦2 𝑦3 𝑦 𝑛
  • 39.
    ELMo Embeddings (Peterset al. 18) Contextualized word embeddings Fine-Tuning: Task Specific Neural Network 𝑅1 𝑅2 𝑅3 𝑅 𝑛 Learnt linear combination of hidden states Embedding (Char-CNN)
  • 40.
    ELMo Embeddings (Peterset al. 18) ELMo Softmax Layer Input sentence Output (probabilities over V) Pretraining on Language Model Your task NN Input sentence Output Train on your Task
  • 41.
    BERT (Devlin etal. 18) Bidirectional Encoder Representations from Transformers (BERT)
  • 42.
    BERT (Devlin etal. 18) Inspired by “Improving Language Understanding by Generative Pre-Training”, Radford et al. 2018 GPT-1 OpenAI Based on Transformer and the multi-head self-attention model Source: https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
  • 43.
    BERT (Devlin etal. 18) Self-attention: “The apple is red, it is delicious” The apple is red , it is delicious The apple is red , it is delicious The apple is red , it is delicious
  • 44.
    BERT (Devlin etal. 18) INPUT WordPieces Embeddings Sentence Embeddings Position Embeddings BERT INPUT REPRESENTATION: Learned during the (pre)training process MASK EMASK In pre-training 15% of the input tokens are masked for the masked LM task
  • 47.
    Training objects inslightly modified BERT models for downstream tasks. (Image source: original paper) Fine-tuning For classification tasks: token 𝐶𝐿𝑆 , 𝒉 𝐿 [𝐶𝐿𝑆] Small weight matrix W: 𝑠𝑜𝑓𝑡𝑚𝑎𝑥(𝒉 𝐿 𝐶𝐿𝑆 𝑾 𝐶𝐿𝑆) BERT (Devlin et al. 18)
  • 48.
    - No needfor custom Neural Network for fine-tuning BERT (Devlin et al. 18)
  • 49.
    XLNet (Yang etal. 19) XLNet: Generalized Autoregressive Pretraining for Language Understanding Problems with Bert: 1. The [MASK] token used in training does not appear during fine-tuning 2. BERT generates predictions independently I went to [MASK] [MASK] and saw the [MASK] [MASK] [MASK].
  • 50.
    XLNet (Yang etal. 19) XLNet: Generalized Autoregressive Pretraining for Language Understanding Bidirectionnal context through randomized prediction of ordered tokens
  • 51.
    OpenAI GPT-2 (Radfordet al. 19) Language Models are Unsupervised Multitask Learners Source: original paper Trained on language model task 40GB corpus dataset 1.5B parameters
  • 52.
    OpenAI GPT-2 (Radfordet al. 19) Language Models are Unsupervised Multitask Learners All the downstream language tasks are framed as predicting conditional probabilities and there is no task-specific fine-tuning. Zero-shot learning: Summarization: 𝑃 𝑤 “𝑡𝑒𝑥𝑡 𝑡𝑜 𝑠𝑢𝑚𝑚𝑎𝑟𝑖𝑧𝑒” + “ 𝑇𝐿; 𝐷𝑅: <? > ”) Question Answering: 𝑃 𝑤 “𝑡𝑒𝑥𝑡” + “𝑄: … 𝐴: … 𝑄: … 𝐴: <? > ”) Machine Translation: 𝑃 𝑤 “𝐼 𝑙𝑖𝑘𝑒 𝑐𝑜𝑚𝑝𝑢𝑡𝑒𝑟𝑠 = 𝐽′ 𝑎𝑖𝑚𝑒 𝑙𝑒𝑠 𝑜𝑟𝑑𝑖𝑛𝑎𝑡𝑒𝑢𝑟𝑠; 𝐼 𝑙𝑖𝑣𝑒 𝑖𝑛 𝑉𝑎𝑛𝑐𝑜𝑢𝑣𝑒𝑟 =<? > ”)
  • 53.
  • 54.
    Source: original paper OpenAIGPT-2 (Radford et al. 19) Language Models are Unsupervised Multitask Learners
  • 56.
    Conclusion - Count-based wordrepresentation (tf-idf) - Learnt word representation (word2vec) - Contextualized embeddings + custom network (ELMo) - Sentence embeddings + fine-tuning (BERT, XLNet) - Zero-short transfer with large language model (GPT-2) Language representation Task specific adaptation
  • 57.
    References Word embeddings: LSA -Indexing by latent semantic analysis, Dumais et al. 1990 Word2Vec - Efficient Estimation of Word Representations in Vector Space, Mikolov et al. 2013 GloVe - GloVe: Global Vectors for Word Representation. Pennington et al. 2014 Subword embeddings CNN character embedding layer - Character-Aware Neural Language Models, Kim et al. 2015 FastText - Enriching Word Vectors with Subword Information, Bojanowski et al. 2017 WordPiece - Google’s NMT System: Bridging the Gap between Human and Machine Translation, Wu et al. 2016 Contextualized embeddings ELMo - Deep contextualized word representations, Peters et al. 2018 CoVe - Learned in Translation: Contextualized Word Vectors, McCann et al. 2017 Pre-trained deep learning architecture Transformer - Attention Is All You Need, Vaswani et al. 2017 OpenAI GPT - Improving language understanding with unsupervised learning, Radford et al. 2018 BERT - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding, Devlin et al. 2018 OpenAI GPT-2 - Language Models are Unsupervised Multitask Learners, Radford et al. 2018
  • 58.
    8th March 2018 Thankyou! tdelteil@amazon.com github.com/ThomasDelteil twitter.com/thdelteil
  • 59.

Editor's Notes

  • #20 Use softmax cross entropy loss Minimize – log p probability of the context world Unsupervised learning process from large corpora
  • #45 Task 1: Mask language model (MLM) Task 2: Next sentence prediction Note that the first token is always forced to be [CLS] — a placeholder that will be used later for prediction in downstream tasks.