Large-Scale Machine Learning:
Use Cases and Technologies
A n d y F e n g
V P A r c h i t e c t u r e , Y a h o o
Agenda
2
§ Use Cases
§ Our Approach
§ ML/DL Algorithms
›  Word2Vec
›  Decision Trees
›  Deep Learning
§ Wrap-up
§  Flickr
›  10 billion photos
•  7.5 million per day
›  No tags for most photos
§  Search: private/public
›  Computed tag
§  Technical capability
›  Scene detection
›  Object recognition
›  Face recognition
User Case: Photos	
https://code.flickr.net/2014/10/15/
100M Flickr photos/videos for R&D
Use Case: Search
§  Intention understanding
§  Content ranking
§  Query-Ads matching
§  Ad click prediction
§  Use case: Sponsored search
›  Target a user when his/her search queries contain a
specific keyword
§  Use case: Search retargeting
›  Target users who have conducted search with a specific
keyword
§  Use case: Email retargeting
›  Target users who have taken actions to emails containing
a specific keyword
§  Problem
›  Exact keyword matching could not reach enough users
§  Solution: keyword rewriting
›  keyword Q è A list of related keywords {Q1,…, Qn}
•  “purina one” è {“dog food", “cat food", “purina pro plan”}
5
Use Case: Advertisement
1.  Massive amount of
examples
›  Naïve solutions take days/weeks
2.  Billions of features
›  Model exceeds memory limits of
1 computer
3.  Variety of algorithms
›  Different solutions required for
scale-up
6
ML Challenge: Scale
7
Image Classification: Deep Learning
§  ILSVRC
›  Classify images w.r.t. 1,000 categories
›  1.2 million images for training
›  50,000 images for validation
§  GoogLeNet: 2014 Winner
›  22 layer deep network
•  Convolution, Pooling, …, Loss
§  Computation
›  Trained on Google DistBelief cluster
(16,000 CPU cores)
è TensorFlow
è TPU
›  47 days for 1 K40c GPU to achieve
88.9% top-5 accuracy http://bit.ly/1FtFMa4
Turn Hadoop Clusters into ML Platform
8
600 PB
HDFS
40K
Computers
MACHINE
LEARNING
9
Big-Data Cluster Enhanced
10
Big-ML Architecture
Examples of Big-ML Algorithms
11
1.  Word Embedding
›  CIKM ‘2016 paper
›  Business impact
2.  Decision Trees
›  NIPS ‘2016 paper
›  Academic collaboration
3.  Deep Learning
›  Open sources (CaffeOnSpark and TensorFlowOnSpark)
›  Industry collaboration
Algorithm 1: Word2Vec (arXiv:1301.3781)
v(paris) = [0.13, -0.4, 0.22, …., -0.45]
v(lion) = [-0.23, -0.1, 0.98, …., 0.65]
v(quark) = [1.4, 0.32, -0.01, …, 0.023]
…
•  compute vector of words
•  captures word semantics
Word2Vec
*images: http://nlp.stanford.edu/projects/glove/
Similarity:
Similar words w1 & w2 have similar v(w1)
and v(w2)
Relationships:
Well captured by direction of
v(w1) – v(w2)
13
§  Semantic Matching of
Query to Ads
§  Example sentence of training
data
gas_cap_replacement_for_car
slc_679f037df54f5d9c41cab05bfae0926
gas_door_replacement_for_car
slc_466145af16a40717c84683db3f899d0a
fuel_door_covers
adid_c_28540527225_285898621262
slc_348709d73214fdeb9782f8b71aff7b6e
autozone_auto_parts
adid_b_3318310706_280452370893
auoto_zone
slc_8dcdab5d20a2caa02b8b1d1c8ccbd36b
Query To Ad: Word2Vec Application
Better Query Coverage With Larger Vocabulary
0%
10%
20%
30%
40%
50%
60%
70%
2M 4M 8M 16M 32M 64M 128M
Query Coverage vs Vocabulary
Size
Design Goals
§  Vocabulary size: 200 million; corpus size: 60 billion
›  500GB memory for 300 dim vectors.
§  Regular training on commodity hardware (128GB memory, 10GbE,
dual socket server)
›  Need distributed system.
§  Available solutions insufficient:
›  Google open source, Spark-MLlib, Deeplearning4J
›  Require that all vectors fit on a single machine
Distributed Word2vec on Parameter Servers
Send word indices
and seeds
Negative sampling,
compute u!v
Word2vec
learners
PS Shards
Aggregate results &
compute “α, β”, global coefficients
V1 row
V2 row
…
Vn row
Update vectors
(v += αu, …)
Each shard stores
a part of every vector
HDFS
. . .
Business Result: Sponsored Search (June 2016)
18
* https://yahooresearch.tumblr.com/post/146257394201
19
Algorithm 2: Decision Trees
Decision Trees: Basics
Distributed Training of Decision Trees
1) Row based partition 2) Column based partition
Yggdrasil: github.com/fabuzaid21/yggdrasil
22
Gradient Boosted Decision Tree: 30x Speed-up
Algorithm 3: Deep Learning
24
Forward
Back-propagation
25
Open Sourced: Deep Learning Frameworks
github.com/yahoo/CaffeOnSpark
(since Feb 2016)
github.com/yahoo/TensorFlowOnSpark
(since Feb 2017)
CaffeOnSpark Architecture
26
TensorFlowOnSpark Architecture
§  1x1 GPU
›  39 hours … 60% top-5 accuracy
§  4x8 GPUs
›  10 hours … 80% top-5 accuracy
›  19x speedup estimated over
1x1 GPU
•  We are working on larger
speedup
28
CaffeOnSpark: 19x Speedup
Training latency (hours)
Top-5ValidationError
TesorFlowOnSpark: Near Linear Scalability
DataWorks Summit 2017 Talks
▪  TensorFlowOnSpark (Lee, Andy)
• Tues 12:20pm, Ballroom B
▪  CaffeOnSpark (Mridul, Jun)
• Wed 12:20pm, 230A
Summary
31
§ Machine learning is critical for business
›  Search, advertisement, recommendation, security, etc.
§ Scalable ML platforms built on big-data clusters
›  Open sources empower collaborations
›  R&D opportunities for algorithm/system innovations
32
Join us for the journey bigdata@yahoo-inc.com

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

  • 1.
    Large-Scale Machine Learning: UseCases and Technologies A n d y F e n g V P A r c h i t e c t u r e , Y a h o o
  • 2.
    Agenda 2 § Use Cases § Our Approach § ML/DLAlgorithms ›  Word2Vec ›  Decision Trees ›  Deep Learning § Wrap-up
  • 3.
    §  Flickr ›  10billion photos •  7.5 million per day ›  No tags for most photos §  Search: private/public ›  Computed tag §  Technical capability ›  Scene detection ›  Object recognition ›  Face recognition User Case: Photos https://code.flickr.net/2014/10/15/ 100M Flickr photos/videos for R&D
  • 4.
    Use Case: Search § Intention understanding §  Content ranking §  Query-Ads matching §  Ad click prediction
  • 5.
    §  Use case:Sponsored search ›  Target a user when his/her search queries contain a specific keyword §  Use case: Search retargeting ›  Target users who have conducted search with a specific keyword §  Use case: Email retargeting ›  Target users who have taken actions to emails containing a specific keyword §  Problem ›  Exact keyword matching could not reach enough users §  Solution: keyword rewriting ›  keyword Q è A list of related keywords {Q1,…, Qn} •  “purina one” è {“dog food", “cat food", “purina pro plan”} 5 Use Case: Advertisement
  • 6.
    1.  Massive amountof examples ›  Naïve solutions take days/weeks 2.  Billions of features ›  Model exceeds memory limits of 1 computer 3.  Variety of algorithms ›  Different solutions required for scale-up 6 ML Challenge: Scale
  • 7.
    7 Image Classification: DeepLearning §  ILSVRC ›  Classify images w.r.t. 1,000 categories ›  1.2 million images for training ›  50,000 images for validation §  GoogLeNet: 2014 Winner ›  22 layer deep network •  Convolution, Pooling, …, Loss §  Computation ›  Trained on Google DistBelief cluster (16,000 CPU cores) è TensorFlow è TPU ›  47 days for 1 K40c GPU to achieve 88.9% top-5 accuracy http://bit.ly/1FtFMa4
  • 8.
    Turn Hadoop Clustersinto ML Platform 8 600 PB HDFS 40K Computers MACHINE LEARNING
  • 9.
  • 10.
  • 11.
    Examples of Big-MLAlgorithms 11 1.  Word Embedding ›  CIKM ‘2016 paper ›  Business impact 2.  Decision Trees ›  NIPS ‘2016 paper ›  Academic collaboration 3.  Deep Learning ›  Open sources (CaffeOnSpark and TensorFlowOnSpark) ›  Industry collaboration
  • 12.
    Algorithm 1: Word2Vec(arXiv:1301.3781) v(paris) = [0.13, -0.4, 0.22, …., -0.45] v(lion) = [-0.23, -0.1, 0.98, …., 0.65] v(quark) = [1.4, 0.32, -0.01, …, 0.023] … •  compute vector of words •  captures word semantics
  • 13.
    Word2Vec *images: http://nlp.stanford.edu/projects/glove/ Similarity: Similar wordsw1 & w2 have similar v(w1) and v(w2) Relationships: Well captured by direction of v(w1) – v(w2) 13
  • 14.
    §  Semantic Matchingof Query to Ads §  Example sentence of training data gas_cap_replacement_for_car slc_679f037df54f5d9c41cab05bfae0926 gas_door_replacement_for_car slc_466145af16a40717c84683db3f899d0a fuel_door_covers adid_c_28540527225_285898621262 slc_348709d73214fdeb9782f8b71aff7b6e autozone_auto_parts adid_b_3318310706_280452370893 auoto_zone slc_8dcdab5d20a2caa02b8b1d1c8ccbd36b Query To Ad: Word2Vec Application
  • 15.
    Better Query CoverageWith Larger Vocabulary 0% 10% 20% 30% 40% 50% 60% 70% 2M 4M 8M 16M 32M 64M 128M Query Coverage vs Vocabulary Size
  • 16.
    Design Goals §  Vocabularysize: 200 million; corpus size: 60 billion ›  500GB memory for 300 dim vectors. §  Regular training on commodity hardware (128GB memory, 10GbE, dual socket server) ›  Need distributed system. §  Available solutions insufficient: ›  Google open source, Spark-MLlib, Deeplearning4J ›  Require that all vectors fit on a single machine
  • 17.
    Distributed Word2vec onParameter Servers Send word indices and seeds Negative sampling, compute u!v Word2vec learners PS Shards Aggregate results & compute “α, β”, global coefficients V1 row V2 row … Vn row Update vectors (v += αu, …) Each shard stores a part of every vector HDFS . . .
  • 18.
    Business Result: SponsoredSearch (June 2016) 18 * https://yahooresearch.tumblr.com/post/146257394201
  • 19.
  • 20.
  • 21.
    Distributed Training ofDecision Trees 1) Row based partition 2) Column based partition
  • 22.
  • 23.
    Gradient Boosted DecisionTree: 30x Speed-up
  • 24.
    Algorithm 3: DeepLearning 24 Forward Back-propagation
  • 25.
    25 Open Sourced: DeepLearning Frameworks github.com/yahoo/CaffeOnSpark (since Feb 2016) github.com/yahoo/TensorFlowOnSpark (since Feb 2017)
  • 26.
  • 27.
  • 28.
    §  1x1 GPU › 39 hours … 60% top-5 accuracy §  4x8 GPUs ›  10 hours … 80% top-5 accuracy ›  19x speedup estimated over 1x1 GPU •  We are working on larger speedup 28 CaffeOnSpark: 19x Speedup Training latency (hours) Top-5ValidationError
  • 29.
  • 30.
    DataWorks Summit 2017Talks ▪  TensorFlowOnSpark (Lee, Andy) • Tues 12:20pm, Ballroom B ▪  CaffeOnSpark (Mridul, Jun) • Wed 12:20pm, 230A
  • 31.
    Summary 31 § Machine learning iscritical for business ›  Search, advertisement, recommendation, security, etc. § Scalable ML platforms built on big-data clusters ›  Open sources empower collaborations ›  R&D opportunities for algorithm/system innovations
  • 32.
    32 Join us forthe journey bigdata@yahoo-inc.com