Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

Large-Scale Machine Learning:
Use Cases and Technologies
A n d y F e n g
V P A r c h i t e c t u r e , Y a h o o

Agenda
2
§ Use Cases
§ Our Approach
§ ML/DL Algorithms
›  Word2Vec
›  Decision Trees
›  Deep Learning
§ Wrap-up

§  Flickr
›  10 billion photos
•  7.5 million per day
›  No tags for most photos
§  Search: private/public
›  Computed tag
§  Technical capability
›  Scene detection
›  Object recognition
›  Face recognition
User Case: Photos
https://code.flickr.net/2014/10/15/
100M Flickr photos/videos for R&D

Use Case: Search
§  Intention understanding
§  Content ranking
§  Query-Ads matching
§  Ad click prediction

§  Use case: Sponsored search
›  Target a user when his/her search queries contain a
specific keyword
§  Use case: Search retargeting
›  Target users who have conducted search with a specific
keyword
§  Use case: Email retargeting
›  Target users who have taken actions to emails containing
a specific keyword
§  Problem
›  Exact keyword matching could not reach enough users
§  Solution: keyword rewriting
›  keyword Q è A list of related keywords {Q1,…, Qn}
•  “purina one” è {“dog food", “cat food", “purina pro plan”}
5
Use Case: Advertisement

1.  Massive amount of
examples
›  Naïve solutions take days/weeks
2.  Billions of features
›  Model exceeds memory limits of
1 computer
3.  Variety of algorithms
›  Different solutions required for
scale-up
6
ML Challenge: Scale

7
Image Classification: Deep Learning
§  ILSVRC
›  Classify images w.r.t. 1,000 categories
›  1.2 million images for training
›  50,000 images for validation
§  GoogLeNet: 2014 Winner
›  22 layer deep network
•  Convolution, Pooling, …, Loss
§  Computation
›  Trained on Google DistBelief cluster
(16,000 CPU cores)
è TensorFlow
è TPU
›  47 days for 1 K40c GPU to achieve
88.9% top-5 accuracy http://bit.ly/1FtFMa4

Turn Hadoop Clusters into ML Platform
8
600 PB
HDFS
40K
Computers
MACHINE
LEARNING

Examples of Big-ML Algorithms
11
1.  Word Embedding
›  CIKM ‘2016 paper
›  Business impact
2.  Decision Trees
›  NIPS ‘2016 paper
›  Academic collaboration
3.  Deep Learning
›  Open sources (CaffeOnSpark and TensorFlowOnSpark)
›  Industry collaboration

Algorithm 1: Word2Vec (arXiv:1301.3781)
v(paris) = [0.13, -0.4, 0.22, …., -0.45]
v(lion) = [-0.23, -0.1, 0.98, …., 0.65]
v(quark) = [1.4, 0.32, -0.01, …, 0.023]
…
•  compute vector of words
•  captures word semantics

Word2Vec
*images: http://nlp.stanford.edu/projects/glove/
Similarity:
Similar words w1 & w2 have similar v(w1)
and v(w2)
Relationships:
Well captured by direction of
v(w1) – v(w2)
13

§  Semantic Matching of
Query to Ads
§  Example sentence of training
data
gas_cap_replacement_for_car
slc_679f037df54f5d9c41cab05bfae0926
gas_door_replacement_for_car
slc_466145af16a40717c84683db3f899d0a
fuel_door_covers
adid_c_28540527225_285898621262
slc_348709d73214fdeb9782f8b71aff7b6e
autozone_auto_parts
adid_b_3318310706_280452370893
auoto_zone
slc_8dcdab5d20a2caa02b8b1d1c8ccbd36b
Query To Ad: Word2Vec Application

Better Query Coverage With Larger Vocabulary
0%
10%
20%
30%
40%
50%
60%
70%
2M 4M 8M 16M 32M 64M 128M
Query Coverage vs Vocabulary
Size

Design Goals
§  Vocabulary size: 200 million; corpus size: 60 billion
›  500GB memory for 300 dim vectors.
§  Regular training on commodity hardware (128GB memory, 10GbE,
dual socket server)
›  Need distributed system.
§  Available solutions insufficient:
›  Google open source, Spark-MLlib, Deeplearning4J
›  Require that all vectors fit on a single machine

Distributed Word2vec on Parameter Servers
Send word indices
and seeds
Negative sampling,
compute u!v
Word2vec
learners
PS Shards
Aggregate results &
compute “α, β”, global coefficients
V1 row
V2 row
…
Vn row
Update vectors
(v += αu, …)
Each shard stores
a part of every vector
HDFS
. . .

Business Result: Sponsored Search (June 2016)
18
* https://yahooresearch.tumblr.com/post/146257394201

19
Algorithm 2: Decision Trees

Distributed Training of Decision Trees
1) Row based partition 2) Column based partition

Yggdrasil: github.com/fabuzaid21/yggdrasil
22

Gradient Boosted Decision Tree: 30x Speed-up

Algorithm 3: Deep Learning
24
Forward
Back-propagation

25
Open Sourced: Deep Learning Frameworks
github.com/yahoo/CaffeOnSpark
(since Feb 2016)
github.com/yahoo/TensorFlowOnSpark
(since Feb 2017)

TensorFlowOnSpark Architecture

§  1x1 GPU
›  39 hours … 60% top-5 accuracy
§  4x8 GPUs
›  10 hours … 80% top-5 accuracy
›  19x speedup estimated over
1x1 GPU
•  We are working on larger
speedup
28
CaffeOnSpark: 19x Speedup
Training latency (hours)
Top-5ValidationError

TesorFlowOnSpark: Near Linear Scalability

DataWorks Summit 2017 Talks
▪  TensorFlowOnSpark (Lee, Andy)
• Tues 12:20pm, Ballroom B
▪  CaffeOnSpark (Mridul, Jun)
• Wed 12:20pm, 230A

Summary
31
§ Machine learning is critical for business
›  Search, advertisement, recommendation, security, etc.
§ Scalable ML platforms built on big-data clusters
›  Open sources empower collaborations
›  R&D opportunities for algorithm/system innovations

32
Join us for the journey bigdata@yahoo-inc.com

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

More Related Content

What's hot

Similar to Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

More from Yahoo Developer Network

Recently uploaded

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies