Staying Shallow & Lean in a
Deep Learning World
Xavier Amatriain (@xamat)
07/13/2016
Our Mission
“To share and grow
the world’s knowledge”
• Millions of questions
• Millions of answers
• Millions of users
• Thousands of topics
• ...
Lots of high-quality textual information
Text + all those other things
Demand
What we care about
Quality
Relevance
ML Applications
● Homepage feed ranking
● Email digest
● Answer quality & ranking
● Spam & harassment classification
● Topic/User recommendation
● Trending Topics
● Automated Topic Labelling
● Related & Duplicate Question
● User trustworthiness
● ...
click
upvote
downvote
expand
share
Models
● Deep Neural Networks
● Logistic Regression
● Elastic Nets
● Gradient Boosted Decision Trees
● Random Forests
● LambdaMART
● Matrix Factorization
● LDA
● ...
●
Deep Learning
Works
Image Recognition
Speech Recognition
Natural Language Processing
Game Playing
Recommender Systems
But...
Deep Learning is not Magic
Deep Learning is not always that “accurate”
… or that “deep”
Other ML Advances
● Factorization Machines
● Tensor Methods
● Non-parametric Bayesian models
● XGBoost
● Online Learning
● Reinforcement Learning
● Learning to rank
● ...
Other very successful approaches
Is it bad to obsess over
Deep Learning?
Some examples
Football or Futbol?
A real-life example
Label
A real-life example: improved solution
Label
Other feature
extraction
algorithms
E
n
s
e
m
b
l
e Accuracy ++
● Goal: Supervised Classification
○ 40 features
○ 10k examples
● What did the ML Engineer choose?
○ Multi-layer ANN trained with Tensor
Flow
● What was his proposed next step?
○ Try ConvNets
● Where is the problem?
○ Hours to train, already looking into
distributing
○ There are much simpler approaches
Another real example
Why DL is not the
only/main solution
Occam’s Razor
● Given two models that perform
more or less equally, you should
always prefer the less complex
● Deep Learning might not be
preferred, even if it squeezes a
+1% in accuracy
Occam’s razor
Occam’s razor: reasons to prefer a simpler model
● There are many others
○ System complexity
○ Maintenance
○ Explainability
○ ….
Occam’s razor: reasons to prefer a simpler model
No Free Lunch
“ (...) any two optimization algorithms are equivalent when their
performance is averaged across all possible problems".
“if an algorithm performs well on a certain class of problems
then it necessarily pays for that with degraded performance on
the set of all remaining problems.”
No Free Lunch Theorem
Feature Engineering
Need for feature engineering
In many cases an understanding of the domain will lead to
optimal results.
Feature Engineering
Feature Engineering Example - Quora Answer Ranking
What is a good Quora answer?
• truthful
• reusable
• provides explanation
• well formatted
• ...
Feature Engineering Example - Quora Answer Ranking
How are those dimensions translated
into features?
• Features that relate to the answer
quality itself
• Interaction features
(upvotes/downvotes, clicks,
comments…)
• User features (e.g. expertise in topic)
Feature Engineering
● Properties of a well-behaved
ML feature:
○ Reusable
○ Transformable
○ Interpretable
○ Reliable
Deep Learning and Feature Engineering
Unsupervised Learning
● Unsupervised learning is a very important paradigm in
theory and in practice
● So far, unsupervised learning has helped deep
learning, but deep learning has not helped
unsupervised learning
Unsupervised Learning
Supervised/Unsupervised Learning
● Unsupervised learning as dimensionality reduction
● Unsupervised learning as feature engineering
● The “magic” behind combining
unsupervised/supervised learning
○ E.g.1 clustering + knn
○ E.g.2 Matrix Factorization
■ MF can be interpreted as
● Unsupervised:
○ Dimensionality Reduction a la PCA
○ Clustering (e.g. NMF)
● Supervised
○ Labeled targets ~ regression
Ensembles
Even if all problems end up being suited for Deep
Learning, there will always be a place for ensembles.
● Given the output of a Deep Learning prediction, you
will be able to combine it with some other model or
feature to improve the results.
Ensembles
Ensembles
● Netflix Prize was won by an ensemble
○ Initially Bellkor was using GDBTs
○ BigChaos introduced ANN-based ensemble
● Most practical applications of ML run an
ensemble
○ Why wouldn’t you?
○ At least as good as the best of your methods
○ Can add completely different approaches
Ensembles & Feature Engineering
● Ensembles are the way to turn any model into a feature!
● E.g. Don’t know if the way to go is to use Factorization Machines, Tensor
Factorization, or RNNs?
○ Treat each model as a “feature”
○ Feed them into an ensemble
Distributing Algorithms
Distributing ML
● Most of what people do in practice can fit
into a multi-core machine
○ Smart data sampling
○ Offline schemes
○ Efficient parallel code
● … but not Deep ANNs
● Do you care about costs? How about latencies or
system complexity/debuggability?
Distributing ML
● That said…
● Deep Learning has managed to get away
by promoting a “new paradigm” of parallel
computing: GPU’s
Conclusions
Conclusions
● Deep Learning has had some impressive results lately
● However, Deep Learning is not the only solution
○ It is dangerous to oversell Deep Learning
● Important to take other things into account
○ Other approaches/models
○ Feature Engineering
○ Unsupervised Learning
○ Ensembles
○ Need to distribute, costs, system complexity...
Questions?
Staying Shallow & Lean in a Deep Learning World

Staying Shallow & Lean in a Deep Learning World

  • 1.
    Staying Shallow &Lean in a Deep Learning World Xavier Amatriain (@xamat) 07/13/2016
  • 2.
    Our Mission “To shareand grow the world’s knowledge” • Millions of questions • Millions of answers • Millions of users • Thousands of topics • ...
  • 3.
    Lots of high-qualitytextual information
  • 4.
    Text + allthose other things
  • 5.
    Demand What we careabout Quality Relevance
  • 6.
    ML Applications ● Homepagefeed ranking ● Email digest ● Answer quality & ranking ● Spam & harassment classification ● Topic/User recommendation ● Trending Topics ● Automated Topic Labelling ● Related & Duplicate Question ● User trustworthiness ● ... click upvote downvote expand share
  • 7.
    Models ● Deep NeuralNetworks ● Logistic Regression ● Elastic Nets ● Gradient Boosted Decision Trees ● Random Forests ● LambdaMART ● Matrix Factorization ● LDA ● ... ●
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
  • 16.
    Deep Learning isnot always that “accurate”
  • 17.
    … or that“deep”
  • 18.
    Other ML Advances ●Factorization Machines ● Tensor Methods ● Non-parametric Bayesian models ● XGBoost ● Online Learning ● Reinforcement Learning ● Learning to rank ● ...
  • 19.
  • 20.
    Is it badto obsess over Deep Learning?
  • 21.
  • 22.
  • 23.
  • 24.
    A real-life example:improved solution Label Other feature extraction algorithms E n s e m b l e Accuracy ++
  • 25.
    ● Goal: SupervisedClassification ○ 40 features ○ 10k examples ● What did the ML Engineer choose? ○ Multi-layer ANN trained with Tensor Flow ● What was his proposed next step? ○ Try ConvNets ● Where is the problem? ○ Hours to train, already looking into distributing ○ There are much simpler approaches Another real example
  • 26.
    Why DL isnot the only/main solution
  • 27.
  • 28.
    ● Given twomodels that perform more or less equally, you should always prefer the less complex ● Deep Learning might not be preferred, even if it squeezes a +1% in accuracy Occam’s razor
  • 29.
    Occam’s razor: reasonsto prefer a simpler model
  • 30.
    ● There aremany others ○ System complexity ○ Maintenance ○ Explainability ○ …. Occam’s razor: reasons to prefer a simpler model
  • 31.
  • 32.
    “ (...) anytwo optimization algorithms are equivalent when their performance is averaged across all possible problems". “if an algorithm performs well on a certain class of problems then it necessarily pays for that with degraded performance on the set of all remaining problems.” No Free Lunch Theorem
  • 33.
  • 34.
    Need for featureengineering In many cases an understanding of the domain will lead to optimal results. Feature Engineering
  • 35.
    Feature Engineering Example- Quora Answer Ranking What is a good Quora answer? • truthful • reusable • provides explanation • well formatted • ...
  • 36.
    Feature Engineering Example- Quora Answer Ranking How are those dimensions translated into features? • Features that relate to the answer quality itself • Interaction features (upvotes/downvotes, clicks, comments…) • User features (e.g. expertise in topic)
  • 37.
    Feature Engineering ● Propertiesof a well-behaved ML feature: ○ Reusable ○ Transformable ○ Interpretable ○ Reliable
  • 38.
    Deep Learning andFeature Engineering
  • 39.
  • 40.
    ● Unsupervised learningis a very important paradigm in theory and in practice ● So far, unsupervised learning has helped deep learning, but deep learning has not helped unsupervised learning Unsupervised Learning
  • 41.
    Supervised/Unsupervised Learning ● Unsupervisedlearning as dimensionality reduction ● Unsupervised learning as feature engineering ● The “magic” behind combining unsupervised/supervised learning ○ E.g.1 clustering + knn ○ E.g.2 Matrix Factorization ■ MF can be interpreted as ● Unsupervised: ○ Dimensionality Reduction a la PCA ○ Clustering (e.g. NMF) ● Supervised ○ Labeled targets ~ regression
  • 42.
  • 43.
    Even if allproblems end up being suited for Deep Learning, there will always be a place for ensembles. ● Given the output of a Deep Learning prediction, you will be able to combine it with some other model or feature to improve the results. Ensembles
  • 44.
    Ensembles ● Netflix Prizewas won by an ensemble ○ Initially Bellkor was using GDBTs ○ BigChaos introduced ANN-based ensemble ● Most practical applications of ML run an ensemble ○ Why wouldn’t you? ○ At least as good as the best of your methods ○ Can add completely different approaches
  • 45.
    Ensembles & FeatureEngineering ● Ensembles are the way to turn any model into a feature! ● E.g. Don’t know if the way to go is to use Factorization Machines, Tensor Factorization, or RNNs? ○ Treat each model as a “feature” ○ Feed them into an ensemble
  • 46.
  • 47.
    Distributing ML ● Mostof what people do in practice can fit into a multi-core machine ○ Smart data sampling ○ Offline schemes ○ Efficient parallel code ● … but not Deep ANNs ● Do you care about costs? How about latencies or system complexity/debuggability?
  • 48.
    Distributing ML ● Thatsaid… ● Deep Learning has managed to get away by promoting a “new paradigm” of parallel computing: GPU’s
  • 49.
  • 50.
    Conclusions ● Deep Learninghas had some impressive results lately ● However, Deep Learning is not the only solution ○ It is dangerous to oversell Deep Learning ● Important to take other things into account ○ Other approaches/models ○ Feature Engineering ○ Unsupervised Learning ○ Ensembles ○ Need to distribute, costs, system complexity...
  • 51.