Tomáš Mikolov - Distributed Representations for NLP

Distributed Representations for
Natural Language Processing
Tomas Mikolov, Facebook
ML Prague 2016

Structure of this talk
• Motivation
• Word2vec
• Architecture
• Evaluation
• Examples
• Discussion

Motivation
Representation of text is very important for performance of many real-world
applications: search, ads recommendation, ranking, spam filtering, …
• Local representations
• N-grams
• 1-of-N coding
• Bag-of-words
• Continuous representations
• Latent Semantic Analysis
• Latent Dirichlet Allocation
• Distributed Representations

Motivation: example
Suppose you want to quickly build a classifier:
• Input = keyword, or user query
• Output = is user interested in X? (where X can be a service, ad, …)
• Toy classifier: is X capital city?
• Getting training examples can be difficult, costly, and time consuming
• With local representations of input (1-of-N), one will need many
training examples for decent performance

Motivation: example
Suppose we have a few training examples:
• (Rome, 1)
• (Turkey, 0)
• (Prague, 1)
• (Australia, 0)
• …
Can we build a good classifier without much effort?

Motivation: example
Suppose we have a few training examples:
• (Rome, 1)
• (Turkey, 0)
• (Prague, 1)
• (Australia, 0)
• …
Can we build a good classifier without much effort?
YES, if we use good pre-trained features.

Motivation: example
Pre-trained features: to leverage vast amount of unannotated text data
• Local features:
• Prague = (0, 1, 0, 0, ..)
• Tokyo = (0, 0, 1, 0, ..)
• Italy = (1, 0, 0, 0, ..)
• Distributed features:
• Prague = (0.2, 0.4, 0.1, ..)
• Tokyo = (0.2, 0.4, 0.3, ..)
• Italy = (0.5, 0.8, 0.2, ..)

Distributed representations
• We hope to learn such representations so that Prague, Rome, Berlin,
Paris etc. will be close to each other
• We do not want just to cluster words: we seek representations that
can capture multiple degrees of similarity: Prague is similar to Berlin
in some way, and to Czech Republic in another way
• Can this be even done without manually created databases like
Wordnet / Knowledge graphs?

Word2vec
• Simple neural nets can be used to obtain distributed representations
of words (Hinton et al, 1986; Elman, 1991; …)
• The resulting representations have interesting structure – vectors can
be obtained using shallow network (Mikolov, 2007)

Word2vec
• Deep learning for NLP (Collobert & Weston, 2008): let’s use deep
neural networks! It works great!
• Back to shallow nets: Word2vec toolkit (Mikolov at el, 2013) -> much
more efficient than deep networks for this task

Word2vec
Two basic architectures:
• Skip-gram
• CBOW
Two training objectives:
• Hierarchical softmax
• Negative sampling
Plus bunch of tricks: weighting of distant words, down-sampling of frequent
words

Skip-gram Architecture
• Predicts the surrounding words given the current word

Continuous Bag-of-words Architecture
• Predicts the current word given the context

Word2vec: Linguistic Regularities
• After training is finished, the weight matrix between the input and hidden layers
represent the word feature vectors
• The word vector space implicitly encodes many regularities among words:

Linguistic Regularities in Word Vector Space
• The resulting distributed representations of words contain
surprisingly a lot of syntactic and semantic information
• There are multiple degrees of similarity among words:
• KING is similar to QUEEN as MAN is similar to WOMAN
• KING is similar to KINGS as MAN is similar to MEN
• Simple vector operations with the word vectors provide very intuitive
results (King – man + woman ~= Queen)

Linguistic Regularities - Evaluation
• Regularity of the learned word vector space was evaluated using test
set with about 20K analogy questions
• The test set contains both syntactic and semantic questions
• Comparison to previous state of art (pre-2013)

Linguistic Regularities - Evaluation

Linguistic Regularities - Examples

Summary and discussion
• Word2vec: much faster and way more accurate than previous neural net
based solutions - speed up of training compared to prior state of art is
more than 10 000 times! (literally from weeks to seconds)
• Features derived from word2vec are now used across all big IT companies
in plenty of applications (search, ads, ..)
• Very popular also in research community: simple way how to boost
performance in many NLP tasks
• Main reasons of success: very fast, open-source, easy to use the resulting
features to boost many applications (even non-NLP)

Follow up work
Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic
comparison of context-counting vs. context-predicting semantic vectors
• Turns out neural based approaches are very close to traditional
distributional semantics models
• Luckily, word2vec significantly outperformed the best previous
models across many tasks 

Follow up work
Pennington, Socher, Manning (2014): Glove: Global Vectors for Word
Representation
• Word2vec version from Stanford: almost identical, but a new name 
• In some sense step back: word2vec counts co-occurrences and does
dimensionality reduction together, Glove is two-pass algorithm

Follow up work
Levy, Goldberg, Dagan (2015): Improving distributional similarity with
lessons learned from word embeddings
• Hyper-parameter tuning is important: debunks the claims of
superiority of Glove
• Compares models trained on the same data (unlike Glove…),
word2vec is faster & vectors better & much less memory consuming
• Many others did end up with similar conclusions (Radim Rehurek, …)

Final notes
• Word2vec is successful because it is simple, but it cannot be applied
everywhere
• For modeling sequences of words, consider Recurrent networks
• Do not sum word vectors to obtain representations of sentences, it will not
work well
• Be careful about the hype, as always … the most cited papers often contain
non-reproducible results

References
• Mikolov (2007): Language Modeling for Speech Recognition in Czech
• Collobert, Weston (2008): A unified architecture for natural language processing: Deep neural networks with
multitask learning
• Mikolov, Karafiat, Burget, Cernocky, Khudanpur (2010): Recurrent neural network based language model
• Mikolov (2012): Statistical Language Models Based on Neural Networks
• Mikolov, Yih, Zweig (2013): Linguistic Regularities in Continuous Space Word Representations
• Mikolov, Chen, Corrado, Dean (2013): Efficient estimation of word representations in vector space
• Mikolov, Sutskever, Chen, Corrado, Dean (2013): Distributed representations of words and phrases and their
compositionality
• Baroni, Dinu, Kruszewski (2014): Don't count, predict! A systematic comparison of context-counting vs.
context-predicting semantic vectors
• Pennington, Socher, Manning (2014): Glove: Global Vectors for Word Representation
• Levy, Goldberg, Dagan (2015): Improving distributional similarity with lessons learned from word
embeddings

Tomáš Mikolov - Distributed Representations for NLP

More Related Content

What's hot

Viewers also liked

Similar to Tomáš Mikolov - Distributed Representations for NLP

More from Machine Learning Prague

Recently uploaded

Tomáš Mikolov - Distributed Representations for NLP