Distributed Representations for
Natural Language Processing
Tomas Mikolov, Facebook
ML Prague 2016
Structure of this talk
• Motivation
• Word2vec
• Architecture
• Evaluation
• Examples
• Discussion
Motivation
Representation of text is very important for performance of many real-world
applications: search, ads recommendation, ranking, spam filtering, …
• Local representations
• N-grams
• 1-of-N coding
• Bag-of-words
• Continuous representations
• Latent Semantic Analysis
• Latent Dirichlet Allocation
• Distributed Representations
Motivation: example
Suppose you want to quickly build a classifier:
• Input = keyword, or user query
• Output = is user interested in X? (where X can be a service, ad, …)
• Toy classifier: is X capital city?
• Getting training examples can be difficult, costly, and time consuming
• With local representations of input (1-of-N), one will need many
training examples for decent performance
Motivation: example
Suppose we have a few training examples:
• (Rome, 1)
• (Turkey, 0)
• (Prague, 1)
• (Australia, 0)
• …
Can we build a good classifier without much effort?
Motivation: example
Suppose we have a few training examples:
• (Rome, 1)
• (Turkey, 0)
• (Prague, 1)
• (Australia, 0)
• …
Can we build a good classifier without much effort?
YES, if we use good pre-trained features.
Motivation: example
Pre-trained features: to leverage vast amount of unannotated text data
• Local features:
• Prague = (0, 1, 0, 0, ..)
• Tokyo = (0, 0, 1, 0, ..)
• Italy = (1, 0, 0, 0, ..)
• Distributed features:
• Prague = (0.2, 0.4, 0.1, ..)
• Tokyo = (0.2, 0.4, 0.3, ..)
• Italy = (0.5, 0.8, 0.2, ..)
Distributed representations
• We hope to learn such representations so that Prague, Rome, Berlin,
Paris etc. will be close to each other
• We do not want just to cluster words: we seek representations that
can capture multiple degrees of similarity: Prague is similar to Berlin
in some way, and to Czech Republic in another way
• Can this be even done without manually created databases like
Wordnet / Knowledge graphs?
Word2vec
• Simple neural nets can be used to obtain distributed representations
of words (Hinton et al, 1986; Elman, 1991; …)
• The resulting representations have interesting structure – vectors can
be obtained using shallow network (Mikolov, 2007)
Word2vec
• Deep learning for NLP (Collobert & Weston, 2008): let’s use deep
neural networks! It works great!
• Back to shallow nets: Word2vec toolkit (Mikolov at el, 2013) -> much
more efficient than deep networks for this task
Word2vec
Two basic architectures:
• Skip-gram
• CBOW
Two training objectives:
• Hierarchical softmax
• Negative sampling
Plus bunch of tricks: weighting of distant words, down-sampling of frequent
words
Skip-gram Architecture
• Predicts the surrounding words given the current word
Continuous Bag-of-words Architecture
• Predicts the current word given the context
Word2vec: Linguistic Regularities
• After training is finished, the weight matrix between the input and hidden layers
represent the word feature vectors
• The word vector space implicitly encodes many regularities among words:
Linguistic Regularities in Word Vector Space
• The resulting distributed representations of words contain
surprisingly a lot of syntactic and semantic information
• There are multiple degrees of similarity among words:
• KING is similar to QUEEN as MAN is similar to WOMAN
• KING is similar to KINGS as MAN is similar to MEN
• Simple vector operations with the word vectors provide very intuitive
results (King – man + woman ~= Queen)
Linguistic Regularities - Evaluation
• Regularity of the learned word vector space was evaluated using test
set with about 20K analogy questions
• The test set contains both syntactic and semantic questions
• Comparison to previous state of art (pre-2013)
Linguistic Regularities - Evaluation
Linguistic Regularities - Examples
Visualization using PCA
Summary and discussion
• Word2vec: much faster and way more accurate than previous neural net
based solutions - speed up of training compared to prior state of art is
more than 10 000 times! (literally from weeks to seconds)
• Features derived from word2vec are now used across all big IT companies
in plenty of applications (search, ads, ..)
• Very popular also in research community: simple way how to boost
performance in many NLP tasks
• Main reasons of success: very fast, open-source, easy to use the resulting
features to boost many applications (even non-NLP)
Follow up work
Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic
comparison of context-counting vs. context-predicting semantic vectors
• Turns out neural based approaches are very close to traditional
distributional semantics models
• Luckily, word2vec significantly outperformed the best previous
models across many tasks 
Follow up work
Pennington, Socher, Manning (2014): Glove: Global Vectors for Word
Representation
• Word2vec version from Stanford: almost identical, but a new name 
• In some sense step back: word2vec counts co-occurrences and does
dimensionality reduction together, Glove is two-pass algorithm
Follow up work
Levy, Goldberg, Dagan (2015): Improving distributional similarity with
lessons learned from word embeddings
• Hyper-parameter tuning is important: debunks the claims of
superiority of Glove
• Compares models trained on the same data (unlike Glove…),
word2vec is faster & vectors better & much less memory consuming
• Many others did end up with similar conclusions (Radim Rehurek, …)
Final notes
• Word2vec is successful because it is simple, but it cannot be applied
everywhere
• For modeling sequences of words, consider Recurrent networks
• Do not sum word vectors to obtain representations of sentences, it will not
work well
• Be careful about the hype, as always … the most cited papers often contain
non-reproducible results
References
• Mikolov (2007): Language Modeling for Speech Recognition in Czech
• Collobert, Weston (2008): A unified architecture for natural language processing: Deep neural networks with
multitask learning
• Mikolov, Karafiat, Burget, Cernocky, Khudanpur (2010): Recurrent neural network based language model
• Mikolov (2012): Statistical Language Models Based on Neural Networks
• Mikolov, Yih, Zweig (2013): Linguistic Regularities in Continuous Space Word Representations
• Mikolov, Chen, Corrado, Dean (2013): Efficient estimation of word representations in vector space
• Mikolov, Sutskever, Chen, Corrado, Dean (2013): Distributed representations of words and phrases and their
compositionality
• Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs.
context-predicting semantic vectors
• Pennington, Socher, Manning (2014): Glove: Global Vectors for Word Representation
• Levy, Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word
embeddings

Tomáš Mikolov - Distributed Representations for NLP

  • 1.
    Distributed Representations for NaturalLanguage Processing Tomas Mikolov, Facebook ML Prague 2016
  • 2.
    Structure of thistalk • Motivation • Word2vec • Architecture • Evaluation • Examples • Discussion
  • 3.
    Motivation Representation of textis very important for performance of many real-world applications: search, ads recommendation, ranking, spam filtering, … • Local representations • N-grams • 1-of-N coding • Bag-of-words • Continuous representations • Latent Semantic Analysis • Latent Dirichlet Allocation • Distributed Representations
  • 4.
    Motivation: example Suppose youwant to quickly build a classifier: • Input = keyword, or user query • Output = is user interested in X? (where X can be a service, ad, …) • Toy classifier: is X capital city? • Getting training examples can be difficult, costly, and time consuming • With local representations of input (1-of-N), one will need many training examples for decent performance
  • 5.
    Motivation: example Suppose wehave a few training examples: • (Rome, 1) • (Turkey, 0) • (Prague, 1) • (Australia, 0) • … Can we build a good classifier without much effort?
  • 6.
    Motivation: example Suppose wehave a few training examples: • (Rome, 1) • (Turkey, 0) • (Prague, 1) • (Australia, 0) • … Can we build a good classifier without much effort? YES, if we use good pre-trained features.
  • 7.
    Motivation: example Pre-trained features:to leverage vast amount of unannotated text data • Local features: • Prague = (0, 1, 0, 0, ..) • Tokyo = (0, 0, 1, 0, ..) • Italy = (1, 0, 0, 0, ..) • Distributed features: • Prague = (0.2, 0.4, 0.1, ..) • Tokyo = (0.2, 0.4, 0.3, ..) • Italy = (0.5, 0.8, 0.2, ..)
  • 8.
    Distributed representations • Wehope to learn such representations so that Prague, Rome, Berlin, Paris etc. will be close to each other • We do not want just to cluster words: we seek representations that can capture multiple degrees of similarity: Prague is similar to Berlin in some way, and to Czech Republic in another way • Can this be even done without manually created databases like Wordnet / Knowledge graphs?
  • 9.
    Word2vec • Simple neuralnets can be used to obtain distributed representations of words (Hinton et al, 1986; Elman, 1991; …) • The resulting representations have interesting structure – vectors can be obtained using shallow network (Mikolov, 2007)
  • 10.
    Word2vec • Deep learningfor NLP (Collobert & Weston, 2008): let’s use deep neural networks! It works great! • Back to shallow nets: Word2vec toolkit (Mikolov at el, 2013) -> much more efficient than deep networks for this task
  • 11.
    Word2vec Two basic architectures: •Skip-gram • CBOW Two training objectives: • Hierarchical softmax • Negative sampling Plus bunch of tricks: weighting of distant words, down-sampling of frequent words
  • 12.
    Skip-gram Architecture • Predictsthe surrounding words given the current word
  • 13.
    Continuous Bag-of-words Architecture •Predicts the current word given the context
  • 14.
    Word2vec: Linguistic Regularities •After training is finished, the weight matrix between the input and hidden layers represent the word feature vectors • The word vector space implicitly encodes many regularities among words:
  • 15.
    Linguistic Regularities inWord Vector Space • The resulting distributed representations of words contain surprisingly a lot of syntactic and semantic information • There are multiple degrees of similarity among words: • KING is similar to QUEEN as MAN is similar to WOMAN • KING is similar to KINGS as MAN is similar to MEN • Simple vector operations with the word vectors provide very intuitive results (King – man + woman ~= Queen)
  • 16.
    Linguistic Regularities -Evaluation • Regularity of the learned word vector space was evaluated using test set with about 20K analogy questions • The test set contains both syntactic and semantic questions • Comparison to previous state of art (pre-2013)
  • 17.
  • 18.
  • 19.
  • 20.
    Summary and discussion •Word2vec: much faster and way more accurate than previous neural net based solutions - speed up of training compared to prior state of art is more than 10 000 times! (literally from weeks to seconds) • Features derived from word2vec are now used across all big IT companies in plenty of applications (search, ads, ..) • Very popular also in research community: simple way how to boost performance in many NLP tasks • Main reasons of success: very fast, open-source, easy to use the resulting features to boost many applications (even non-NLP)
  • 21.
    Follow up work Baroni,Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors • Turns out neural based approaches are very close to traditional distributional semantics models • Luckily, word2vec significantly outperformed the best previous models across many tasks 
  • 22.
    Follow up work Pennington,Socher, Manning (2014): Glove: Global Vectors for Word Representation • Word2vec version from Stanford: almost identical, but a new name  • In some sense step back: word2vec counts co-occurrences and does dimensionality reduction together, Glove is two-pass algorithm
  • 23.
    Follow up work Levy,Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word embeddings • Hyper-parameter tuning is important: debunks the claims of superiority of Glove • Compares models trained on the same data (unlike Glove…), word2vec is faster & vectors better & much less memory consuming • Many others did end up with similar conclusions (Radim Rehurek, …)
  • 24.
    Final notes • Word2vecis successful because it is simple, but it cannot be applied everywhere • For modeling sequences of words, consider Recurrent networks • Do not sum word vectors to obtain representations of sentences, it will not work well • Be careful about the hype, as always … the most cited papers often contain non-reproducible results
  • 25.
    References • Mikolov (2007):Language Modeling for Speech Recognition in Czech • Collobert, Weston (2008): A unified architecture for natural language processing: Deep neural networks with multitask learning • Mikolov, Karafiat, Burget, Cernocky, Khudanpur (2010): Recurrent neural network based language model • Mikolov (2012): Statistical Language Models Based on Neural Networks • Mikolov, Yih, Zweig (2013): Linguistic Regularities in Continuous Space Word Representations • Mikolov, Chen, Corrado, Dean (2013): Efficient estimation of word representations in vector space • Mikolov, Sutskever, Chen, Corrado, Dean (2013): Distributed representations of words and phrases and their compositionality • Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors • Pennington, Socher, Manning (2014): Glove: Global Vectors for Word Representation • Levy, Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word embeddings