1 
Lessons Learned from 
Building Machine Learning Software at Netflix 
Justin Basilico 
Page Algorithms Engineering December 13, 2014 
@JustinBasilico 
Workshop 2014
2 
Introduction
3 
Introduction 
2006 2014
4 
Netflix Scale 
 > 50M members 
 > 40 countries 
 > 1000 device types 
 Hours: > 2B/month 
 Plays: > 70M/day 
 Log 100B events/day 
 34.2% of peak US 
downstream traffic
5 
Goal 
Help members find content to watch and enjoy 
to maximize member satisfaction and retention
6 
Everything is a Recommendation 
Rows 
Ranking 
Over 75% of what 
people watch 
comes from our 
recommendations 
Recommendations 
are driven by 
Machine Learning
7 
Machine Learning Approach 
Problem 
Data 
Metrics 
Model Algorithm
8 
Models & Algorithms 
 Regression (Linear, logistic, elastic net) 
 SVD and other Matrix Factorizations 
 Factorization Machines 
 Restricted Boltzmann Machines 
 Deep Neural Networks 
 Markov Models and Graph Algorithms 
 Clustering 
 Latent Dirichlet Allocation 
 Gradient Boosted Decision 
Trees/Random Forests 
 Gaussian Processes 
 …
9 
Design Considerations 
Recommendations 
• Personal 
• Accurate 
• Diverse 
• Novel 
• Fresh 
Software 
• Scalable 
• Responsive 
• Resilient 
• Efficient 
• Flexible
10 
Software Stack 
http://techblog.netflix.com
11 
Lessons Learned
12 
Lesson 1: 
Be flexible about where and when 
computation happens.
13 
System Architecture 
 Offline: Process data 
 Nearline: Process events 
 Online: Process requests 
 Learning, Features, or Model 
evaluation can be done at any 
level 
Netflix.Hermes 
Netflix.Manhattan 
Nearline 
Computation 
Models 
Online 
Data Service 
Offline Data 
Model 
training 
Online 
Computation 
Event Distribution 
User Event 
Queue 
Algorithm 
Service 
UI Client 
Member 
Query results 
Recommendations 
NEARLINE 
Machine 
Learning 
Algorithm 
Machine 
Learning 
Algorithm 
Offline 
Computation Machine 
Learning 
Algorithm 
Play, Rate, 
Browse... 
OFFLINE 
ONLINE 
More details on Netflix Techblog
14 
Where to place components? 
 Example: Matrix Factorization 
 Offline: 
 Collect sample of play data 
 Run batch learning algorithm like 
SGD to produce factorization 
 Publish video factors 
 Nearline: 
 Solve user factors 
 Compute user-video dot products 
 Store scores in cache 
 Online: 
 Presentation-context filtering 
 Serve recommendations 
Netflix.Hermes 
Netflix.Manhattan 
X≈UVt 
Nearline 
Computation 
Models 
Online 
Data Service 
Offline Data 
Model 
training 
Online 
Computation 
Event Distribution 
User Event 
Queue 
Algorithm 
Service 
UI Client 
Member 
Query results 
Recommendations 
NEARLINE 
Machine 
Learning 
Algorithm 
Machine 
Learning 
Algorithm 
Offline 
Computation Machine 
Learning 
Algorithm 
Play, Rate, 
Browse... 
OFFLINE 
ONLINE 
V 
sij=uivj Aui=b 
sij 
X 
sij>t
15 
Lesson 2: 
Think about distribution starting from the 
outermost levels.
16 
Three levels of Learning Distribution/Parallelization 
1. For each subset of the population (e.g. 
region) 
 Want independently trained and tuned models 
2. For each combination of (hyper)parameters 
 Simple: Grid search 
 Better: Bayesian optimization using Gaussian 
Processes 
3. For each subset of the training data 
 Distribute over machines (e.g. ADMM) 
 Multi-core parallelism (e.g. HogWild) 
 Or… use GPUs
17 
Example: Training Neural Networks 
 Level 1: Machines in different 
AWS regions 
 Level 2: Machines in same AWS 
region 
 Spearmint or MOE for parameter 
optimization 
 Condor, StarCluster, Mesos, etc. for 
coordination 
 Level 3: Highly optimized, parallel 
CUDA code on GPUs
18 
Lesson 3: 
Design application software for 
experimentation.
19 
Example development process 
Idea Data 
Offline 
Modeling 
(R, Python, 
MATLAB, …) 
Iterate 
Implement in 
production 
system (Java, 
C++, …) 
Data 
discrepancies 
Missing post-processing 
logic 
Performance 
issues 
Actual 
output 
Experimentation environment 
Production environment 
(A/B test) Code 
discrepancies 
Final 
model
20 
Avoid dual implementations 
Shared Engine 
Experiment 
code 
Production 
code 
Experiment Production • Models 
• Features 
• Algorithms 
• …
21 
Solution: Share and lean towards production 
 Developing machine learning is an iterative process 
 Want a short pipeline to rapidly try ideas 
 Want to see output of complete system, not just learned component 
 Make application components easy to experiment with 
 Share them between online, nearline, and offline 
 Make it possible to run individual parts of the software 
 Use the real code whenever possible 
 Have well-defined interfaces and formats to allow you to go 
off-the-beaten path
22 
Lesson 4: 
Make algorithms extensible and modular.
23 
Make algorithms and models extensible and modular 
 Algorithms often need to be tailored for a 
specific application 
 Treating an algorithm as a black box is 
limiting 
 Better to make algorithms extensible and 
modular to allow for customization 
 Separate models and algorithms 
 Many algorithms can learn the same model 
(i.e. linear binary classifier) 
 Many algorithms can be trained on the same 
types of data 
 Support composing algorithms 
Data 
Parameters 
Data 
Model 
Parameters 
Model 
Algorithm 
Vs.
24 
Provide building blocks 
 Don’t start from scratch 
 Linear algebra: Vectors, Matrices, … 
 Statistics: Distributions, tests, … 
 Models, features, metrics, ensembles, … 
 Cost, distance, kernel, … functions 
 Optimization, inference, … 
 Layers, activation functions, … 
 Initializers, stopping criteria, … 
 … 
 Domain-specific components 
Build abstractions on 
familiar concepts 
Make the software put 
them together
25 
Example: Tailoring Random Forests 
Use a custom 
tree split 
Customize to 
run it for an 
hour 
Report a 
custom metric 
each iteration 
Inspect the 
ensemble 
Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
26 
Lesson 5: 
Describe your input and output 
transformations with your model.
27 
Putting learning in an application 
Application 
Application or model code? 
Feature 
Encoding 
Output 
Decoding 
? Machine 
Learned Model 
Rd ⟶ Rk
28 
Example: Simple ranking system 
 High-level API: List<Video> rank(User u, List<Video> videos) 
 Example model description file: 
{ 
“type”: “ScoringRanker”, 
“scorer”: { 
“type”: “FeatureScorer”, 
“features”: [ 
{“type”: “Popularity”, “days”: 10}, 
{“type”: “PredictedRating”} 
], 
“function”: { 
“type”: “Linear”, 
“bias”: -0.5, 
“weights”: { 
“popularity”: 0.2, 
“predictedRating”: 1.2, 
“predictedRating*popularity”: 
3.5 
} 
} 
} 
} 
Ranker 
Scorer 
Features 
Linear function 
Feature transformations
29 
Lesson 6: 
Don’t just rely on metrics for testing.
30 
Importance of Testing 
 Temptation: Use validation metrics to test software 
 When things work this seems great 
 When metrics don’t improve: was it the code, data, metric, idea, …? 
 Machine learning code involves intricate math and logic 
 Rounding issues, corner cases, … 
 Is that a + or -? (The math or paper could be wrong.) 
 Solution: Unit test 
 Testing of metric code is especially important 
 Test the whole system 
 Compare output for unexpected changes across versions
31 
Conclusions
32 
Two ways to solve computational problems 
Know 
solution 
Write code 
Compile 
code 
Test code Deploy code 
Know 
relevant 
data 
Develop 
algorithmic 
approach 
Train model 
on data using 
algorithm 
Validate 
model with 
metrics 
Deploy 
model 
Software Development 
Machine Learning 
(steps may involve Software Development)
33 
Take-aways for building machine learning software 
 Building machine learning is an iterative process 
 Make experimentation easy 
 Take a holistic view of both the application and experimental 
environments 
 Optimize only what matters 
 Testing can be hard but is worthwhile
Thank You Justin Basilico 
jbasilico@netflix.com 
34 @JustinBasilico 
We’re hiring

Lessons Learned from Building Machine Learning Software at Netflix

  • 1.
    1 Lessons Learnedfrom Building Machine Learning Software at Netflix Justin Basilico Page Algorithms Engineering December 13, 2014 @JustinBasilico Workshop 2014
  • 2.
  • 3.
  • 4.
    4 Netflix Scale  > 50M members  > 40 countries  > 1000 device types  Hours: > 2B/month  Plays: > 70M/day  Log 100B events/day  34.2% of peak US downstream traffic
  • 5.
    5 Goal Helpmembers find content to watch and enjoy to maximize member satisfaction and retention
  • 6.
    6 Everything isa Recommendation Rows Ranking Over 75% of what people watch comes from our recommendations Recommendations are driven by Machine Learning
  • 7.
    7 Machine LearningApproach Problem Data Metrics Model Algorithm
  • 8.
    8 Models &Algorithms  Regression (Linear, logistic, elastic net)  SVD and other Matrix Factorizations  Factorization Machines  Restricted Boltzmann Machines  Deep Neural Networks  Markov Models and Graph Algorithms  Clustering  Latent Dirichlet Allocation  Gradient Boosted Decision Trees/Random Forests  Gaussian Processes  …
  • 9.
    9 Design Considerations Recommendations • Personal • Accurate • Diverse • Novel • Fresh Software • Scalable • Responsive • Resilient • Efficient • Flexible
  • 10.
    10 Software Stack http://techblog.netflix.com
  • 11.
  • 12.
    12 Lesson 1: Be flexible about where and when computation happens.
  • 13.
    13 System Architecture  Offline: Process data  Nearline: Process events  Online: Process requests  Learning, Features, or Model evaluation can be done at any level Netflix.Hermes Netflix.Manhattan Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE More details on Netflix Techblog
  • 14.
    14 Where toplace components?  Example: Matrix Factorization  Offline:  Collect sample of play data  Run batch learning algorithm like SGD to produce factorization  Publish video factors  Nearline:  Solve user factors  Compute user-video dot products  Store scores in cache  Online:  Presentation-context filtering  Serve recommendations Netflix.Hermes Netflix.Manhattan X≈UVt Nearline Computation Models Online Data Service Offline Data Model training Online Computation Event Distribution User Event Queue Algorithm Service UI Client Member Query results Recommendations NEARLINE Machine Learning Algorithm Machine Learning Algorithm Offline Computation Machine Learning Algorithm Play, Rate, Browse... OFFLINE ONLINE V sij=uivj Aui=b sij X sij>t
  • 15.
    15 Lesson 2: Think about distribution starting from the outermost levels.
  • 16.
    16 Three levelsof Learning Distribution/Parallelization 1. For each subset of the population (e.g. region)  Want independently trained and tuned models 2. For each combination of (hyper)parameters  Simple: Grid search  Better: Bayesian optimization using Gaussian Processes 3. For each subset of the training data  Distribute over machines (e.g. ADMM)  Multi-core parallelism (e.g. HogWild)  Or… use GPUs
  • 17.
    17 Example: TrainingNeural Networks  Level 1: Machines in different AWS regions  Level 2: Machines in same AWS region  Spearmint or MOE for parameter optimization  Condor, StarCluster, Mesos, etc. for coordination  Level 3: Highly optimized, parallel CUDA code on GPUs
  • 18.
    18 Lesson 3: Design application software for experimentation.
  • 19.
    19 Example developmentprocess Idea Data Offline Modeling (R, Python, MATLAB, …) Iterate Implement in production system (Java, C++, …) Data discrepancies Missing post-processing logic Performance issues Actual output Experimentation environment Production environment (A/B test) Code discrepancies Final model
  • 20.
    20 Avoid dualimplementations Shared Engine Experiment code Production code Experiment Production • Models • Features • Algorithms • …
  • 21.
    21 Solution: Shareand lean towards production  Developing machine learning is an iterative process  Want a short pipeline to rapidly try ideas  Want to see output of complete system, not just learned component  Make application components easy to experiment with  Share them between online, nearline, and offline  Make it possible to run individual parts of the software  Use the real code whenever possible  Have well-defined interfaces and formats to allow you to go off-the-beaten path
  • 22.
    22 Lesson 4: Make algorithms extensible and modular.
  • 23.
    23 Make algorithmsand models extensible and modular  Algorithms often need to be tailored for a specific application  Treating an algorithm as a black box is limiting  Better to make algorithms extensible and modular to allow for customization  Separate models and algorithms  Many algorithms can learn the same model (i.e. linear binary classifier)  Many algorithms can be trained on the same types of data  Support composing algorithms Data Parameters Data Model Parameters Model Algorithm Vs.
  • 24.
    24 Provide buildingblocks  Don’t start from scratch  Linear algebra: Vectors, Matrices, …  Statistics: Distributions, tests, …  Models, features, metrics, ensembles, …  Cost, distance, kernel, … functions  Optimization, inference, …  Layers, activation functions, …  Initializers, stopping criteria, …  …  Domain-specific components Build abstractions on familiar concepts Make the software put them together
  • 25.
    25 Example: TailoringRandom Forests Use a custom tree split Customize to run it for an hour Report a custom metric each iteration Inspect the ensemble Using Cognitive Foundry: http://github.com/algorithmfoundry/Foundry
  • 26.
    26 Lesson 5: Describe your input and output transformations with your model.
  • 27.
    27 Putting learningin an application Application Application or model code? Feature Encoding Output Decoding ? Machine Learned Model Rd ⟶ Rk
  • 28.
    28 Example: Simpleranking system  High-level API: List<Video> rank(User u, List<Video> videos)  Example model description file: { “type”: “ScoringRanker”, “scorer”: { “type”: “FeatureScorer”, “features”: [ {“type”: “Popularity”, “days”: 10}, {“type”: “PredictedRating”} ], “function”: { “type”: “Linear”, “bias”: -0.5, “weights”: { “popularity”: 0.2, “predictedRating”: 1.2, “predictedRating*popularity”: 3.5 } } } } Ranker Scorer Features Linear function Feature transformations
  • 29.
    29 Lesson 6: Don’t just rely on metrics for testing.
  • 30.
    30 Importance ofTesting  Temptation: Use validation metrics to test software  When things work this seems great  When metrics don’t improve: was it the code, data, metric, idea, …?  Machine learning code involves intricate math and logic  Rounding issues, corner cases, …  Is that a + or -? (The math or paper could be wrong.)  Solution: Unit test  Testing of metric code is especially important  Test the whole system  Compare output for unexpected changes across versions
  • 31.
  • 32.
    32 Two waysto solve computational problems Know solution Write code Compile code Test code Deploy code Know relevant data Develop algorithmic approach Train model on data using algorithm Validate model with metrics Deploy model Software Development Machine Learning (steps may involve Software Development)
  • 33.
    33 Take-aways forbuilding machine learning software  Building machine learning is an iterative process  Make experimentation easy  Take a holistic view of both the application and experimental environments  Optimize only what matters  Testing can be hard but is worthwhile
  • 34.
    Thank You JustinBasilico jbasilico@netflix.com 34 @JustinBasilico We’re hiring

Editor's Notes

  • #14 http://techblog.netflix.com/2013/03/system-architectures-for.html
  • #18 http://techblog.netflix.com/2014/02/distributed-neural-networks-with-gpus.html
  • #35 http://jobs.netflix.com/jobs.php?id=NFX01267