A Missing Link in the ML
Infrastructure Stack
Josh Tobin
Stealth Startup, UC Berkeley, Former OpenAI
Machine Learning is now a
product engineering
discipline
Machine Learning is now a
product engineering
discipline
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
How did we get here?
4
ML analytics
2000s
• Simple models run offline
on medium to large
datasets to produce
reports
• Value comes from
incorporating model
insights into decisions
ML hype
2010s
• Complicated models
trained on massive
datasets to produce
papers
• Value comes from
marketing potential of
high-profile research
output
ML products
2020s?
• Reproducibility, scalability,
and maintainability over
complexity
• Value comes from models
improving the business’s
products or services
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
ML products require a fundamentally new process
5
“Flat-earth” ML
Collect data
Clean and
label
Train Report
Select problem
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
ML products require a fundamentally new process
6
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
ML Product Engineering
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
ML teams that don’t make the transition die
7
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Other disciplines will catch up to model training in prestige and
pay
• The three Ps (papers, pie charts, PoCs) are no longer enough
What does it mean for you?
8
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Those that make the transition will create amazing things
9
• Autonomous Vehicles
• Real-time translation
• Drug discovery
• Marketing automation
• Personalization
• Document understanding
• Etc
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Run online and in real-time
• Deal with constantly evolving data distributions
• Handle messy, long-tail real world data
• Make predictions autonomously or semi-autonomously
Unlike flat-earth ML, ML products often:
10
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Run online and in real-time
Host and serve models with low latency
• Deal with constantly evolving data distributions
Retrain models frequently, even continuously
• Handle messy, long-tail real world data
Inspect your data scalable, manage slices and edge cases
• Make predictions autonomously or semi-autonomously
Quickly catch and diagnose bugs and distribution changes
This implies new ops & infra demands
11
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Is the infrastructure stack keeping up?
12
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Reproducible pipelines
Training infrastructure
Experiment management
Train
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Is the infrastructure stack keeping up?
13
Collect data
Clean and
label
Train Report
Select problem
Deploy
Monitor
Train
What’s still hard?
• Surfacing areas of poor
performance
• Managing all your test
cases
Model perf exploration
Test
CI/CD tools
Explainability tools
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Test
Is the infrastructure stack keeping up?
14
Collect data
Clean and
label
Train Report
Select problem
Monitor
Train
Deploy
Model serving
Feature stores
What’s still hard?
• Experimentation (AB
tests, shadow tests)
• Online / offline
consistency
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Deploy Test
Is the infrastructure stack keeping up?
15
Collect data
Clean and
label
Train Report
Select problem Train
System monitoring
What’s still hard?
• Performance
monitoring
• Drift is still a bit of an
art
Monitor
Data quality / drift
Deequ
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Monitor Deploy Test
Is the infrastructure stack keeping up?
16
Clean and
label
Train Report
Select problem Train
What’s still hard?
• Subsampling data
• Connecting the data
back to the model
Collect data
Data lakes, warehouses
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Collect data
Monitor Deploy Test
Is the infrastructure stack keeping up?
17
Train Report
Select problem Train
What’s still hard?
• What data should I
label?
• What data should I
train on?
Clean and
label
Labeling tools & services Active learning tools
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Clean and
label
Collect data
Monitor Deploy Test
Is the infrastructure stack keeping up?
18
Train Report
Select problem
What’s still hard?
• How do I know when to
retrain?
• (Retraining online)
Train
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Many tools emerging to address the problems of ML product
engineering
• Problems arise at the boundaries of the tools, especially anything
that shepherds data through the process
• At all stages, granular understanding of model performance is
lacking
Takeaways
19
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
A central place to store and query online and offline ground truth
and approximate model quality metrics
The Evaluation Store
20
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin 21
Eval Store
Training Evaluation
Production
Data and prediction profiles
Metric & slice definitions
Feedback on model predictions
Feature store
Model hub
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Subset of models in the store
• Subset of metrics in the store
• Subset of slices in the store
• Specification of the window of data
Querying the evaluation store
22
What form do queries take?
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Subset of models in the store
• Subset of metrics in the store
• Subset of slices in the store
• Specification of the window of data
Querying the evaluation store
23
What form do queries take? E.g.,
What is the importance-weighted
average drift across all of my features in
my production model in the last 60
minutes?
Monitoring
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Subset of models in the store
• Subset of metrics in the store
• Subset of slices in the store
• Specification of the window of data
Querying the evaluation store
24
What form do queries take? E.g.,
How much worse is the my accuracy in
the last 7 days than it was during
training?
Monitoring
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Subset of models in the store
• Subset of metrics in the store
• Subset of slices in the store
• Specification of the window of data
Querying the evaluation store
25
What form do queries take? E.g.,
How do all of the metrics compare for
model A and model B across all slices in
my main evaluation set?
Testing
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Subset of models in the store
• Subset of metrics in the store
• Subset of slices in the store
• Specification of the window of data
Querying the evaluation store
26
What form do queries take? E.g.,
How do my business metrics compare
for model A and model B in the last 60
minutes
AB testing
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• In a perfect world, we would know right away how well the model performs on all data
points seen in production
• In the real world, labels are unreliable, expensive, and delayed
• Approximate performance metrics are ways to guess which data points may have poor
performance
• E.g., distribution distance between these data points and a reference distribution
• E.g., outlier detection
• E.g., weak supervision (a la Snorkel)
• E.g., metrics about your users (like engagement)
A digression: approximate performance metrics
27
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
The Evaluation Store
28
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Eval Store
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
The Evaluation Store
29
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Train
Eval Store
• Register data distribution and
performance for this model
• Warn us if training data looks
too different than prod
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
The Evaluation Store
30
Collect data
Clean and
label
Train Report
Select problem
Deploy
Monitor
Train
Test
Eval Store
• Register performance for this
model on all test slices
• Pull historical that has been
flagged as “interesting” (e.g.,
gave another model trouble)
• Pull definitions of slices
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Test
The Evaluation Store
31
Collect data
Clean and
label
Train Report
Select problem
Monitor
Train
Deploy
Eval Store
• Run a shadow test or AB test by
pulling the diff in model
performance between versions
• Log data and approximate
performance back to the eval
store
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Deploy Test
The Evaluation Store
32
Collect data
Clean and
label
Train Report
Select problem Train
Monitor
Eval Store
• Fire an alert when approximate
performance on any of our
slices dips below a threshold
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Monitor Deploy Test
The Evaluation Store
33
Clean and
label
Train Report
Select problem Train
Collect data
Eval Store
• Log more data with low or
uncertain approximate
performance
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Collect data
Monitor Deploy Test
The Evaluation Store
34
Train Report
Select problem Train
Clean and
label
Eval Store
• Inspect & label data with low
approximate performance
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Clean and
label
Collect data
Monitor Deploy Test
The Evaluation Store
35
Train Report
Select problem Train
Eval Store
• Retrain when approximate
performance dips below a
threshold
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Reduce organization friction. Get stakeholders (ML eng, ML research, PM, MLOps,
etc) on the same page about metric and slice definitions
• Deploy models more confidently. Evaluate metrics and slices consistently in
testing and prod. Make the metrics visible to stakeholders
• Catch production bugs faster. Catch degradations across any slice, and drill
down to the data that caused the degradation
• Reduce data-related costs. Collect and label production data more intelligently
• Make your model better. Decide when to retrain. Pick the right data to retrain on.
What could an eval store help you with?
36
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Feature store is indexed by feature, eval store is indexed by model
• A model taking a feature as input doesn’t mean that it looks at the
entire distribution
• A “poor quality” feature has different effects on different models
• Not all data will come through the feature store
• The two should talk to each other!
Shouldn’t the feature store do this?
37
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• Yes
• The hard part here is approximating how well your model might
be performing right now
• That’s ML monitoring
Wait, isn’t this just ML monitoring?
38
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• No
• Eval store should provide a consistent view of online and offline
performance
• Eval store is tightly integrated into the entire MLOps stack
• Eval store keeps track of what data caused questions
performance, so it can be used for testing and retraining
Wait, isn’t this just ML monitoring?
39
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
ML monitoring
40
Evaluation Production
Training Monitoring
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Eval store
41
Evaluation Production
Training
Eval store
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Case study 1: the Tesla data engine
42
youtube.com/watch?t=7714&v=Ucp0TTmvqOE
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Case study 2: TFX data validation
43
https://mlsys.org/Conferences/2019/doc/2019/167.pdf
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
Case study 3: Overton (Apple)
44
https://machinelearning.apple.com/research/overton
Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
• To turn ML into a product engineering discipline, we need an
infrastructure stack that helps create a data flywheel
• What’s still missing?
• Granular, online-offline understanding of model performance
• Orchestrating data and models throughout the whole loop
• Maybe the Evaluation Store could help
A Missing Link in the ML Infra Stack?
45

A missing link in the ML infrastructure stack?

  • 1.
    A Missing Linkin the ML Infrastructure Stack Josh Tobin Stealth Startup, UC Berkeley, Former OpenAI
  • 2.
    Machine Learning isnow a product engineering discipline
  • 3.
    Machine Learning isnow a product engineering discipline
  • 4.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin How did we get here? 4 ML analytics 2000s • Simple models run offline on medium to large datasets to produce reports • Value comes from incorporating model insights into decisions ML hype 2010s • Complicated models trained on massive datasets to produce papers • Value comes from marketing potential of high-profile research output ML products 2020s? • Reproducibility, scalability, and maintainability over complexity • Value comes from models improving the business’s products or services
  • 5.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin ML products require a fundamentally new process 5 “Flat-earth” ML Collect data Clean and label Train Report Select problem
  • 6.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin ML products require a fundamentally new process 6 Collect data Clean and label Train Report Select problem Test Deploy Monitor ML Product Engineering
  • 7.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin ML teams that don’t make the transition die 7
  • 8.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Other disciplines will catch up to model training in prestige and pay • The three Ps (papers, pie charts, PoCs) are no longer enough What does it mean for you? 8
  • 9.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Those that make the transition will create amazing things 9 • Autonomous Vehicles • Real-time translation • Drug discovery • Marketing automation • Personalization • Document understanding • Etc
  • 10.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Run online and in real-time • Deal with constantly evolving data distributions • Handle messy, long-tail real world data • Make predictions autonomously or semi-autonomously Unlike flat-earth ML, ML products often: 10
  • 11.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Run online and in real-time Host and serve models with low latency • Deal with constantly evolving data distributions Retrain models frequently, even continuously • Handle messy, long-tail real world data Inspect your data scalable, manage slices and edge cases • Make predictions autonomously or semi-autonomously Quickly catch and diagnose bugs and distribution changes This implies new ops & infra demands 11
  • 12.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Is the infrastructure stack keeping up? 12 Collect data Clean and label Train Report Select problem Test Deploy Monitor Reproducible pipelines Training infrastructure Experiment management Train
  • 13.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Is the infrastructure stack keeping up? 13 Collect data Clean and label Train Report Select problem Deploy Monitor Train What’s still hard? • Surfacing areas of poor performance • Managing all your test cases Model perf exploration Test CI/CD tools Explainability tools
  • 14.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Test Is the infrastructure stack keeping up? 14 Collect data Clean and label Train Report Select problem Monitor Train Deploy Model serving Feature stores What’s still hard? • Experimentation (AB tests, shadow tests) • Online / offline consistency
  • 15.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Deploy Test Is the infrastructure stack keeping up? 15 Collect data Clean and label Train Report Select problem Train System monitoring What’s still hard? • Performance monitoring • Drift is still a bit of an art Monitor Data quality / drift Deequ
  • 16.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Monitor Deploy Test Is the infrastructure stack keeping up? 16 Clean and label Train Report Select problem Train What’s still hard? • Subsampling data • Connecting the data back to the model Collect data Data lakes, warehouses
  • 17.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Collect data Monitor Deploy Test Is the infrastructure stack keeping up? 17 Train Report Select problem Train What’s still hard? • What data should I label? • What data should I train on? Clean and label Labeling tools & services Active learning tools
  • 18.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Clean and label Collect data Monitor Deploy Test Is the infrastructure stack keeping up? 18 Train Report Select problem What’s still hard? • How do I know when to retrain? • (Retraining online) Train
  • 19.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Many tools emerging to address the problems of ML product engineering • Problems arise at the boundaries of the tools, especially anything that shepherds data through the process • At all stages, granular understanding of model performance is lacking Takeaways 19
  • 20.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin A central place to store and query online and offline ground truth and approximate model quality metrics The Evaluation Store 20
  • 21.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin 21 Eval Store Training Evaluation Production Data and prediction profiles Metric & slice definitions Feedback on model predictions Feature store Model hub
  • 22.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Subset of models in the store • Subset of metrics in the store • Subset of slices in the store • Specification of the window of data Querying the evaluation store 22 What form do queries take?
  • 23.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Subset of models in the store • Subset of metrics in the store • Subset of slices in the store • Specification of the window of data Querying the evaluation store 23 What form do queries take? E.g., What is the importance-weighted average drift across all of my features in my production model in the last 60 minutes? Monitoring
  • 24.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Subset of models in the store • Subset of metrics in the store • Subset of slices in the store • Specification of the window of data Querying the evaluation store 24 What form do queries take? E.g., How much worse is the my accuracy in the last 7 days than it was during training? Monitoring
  • 25.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Subset of models in the store • Subset of metrics in the store • Subset of slices in the store • Specification of the window of data Querying the evaluation store 25 What form do queries take? E.g., How do all of the metrics compare for model A and model B across all slices in my main evaluation set? Testing
  • 26.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Subset of models in the store • Subset of metrics in the store • Subset of slices in the store • Specification of the window of data Querying the evaluation store 26 What form do queries take? E.g., How do my business metrics compare for model A and model B in the last 60 minutes AB testing
  • 27.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • In a perfect world, we would know right away how well the model performs on all data points seen in production • In the real world, labels are unreliable, expensive, and delayed • Approximate performance metrics are ways to guess which data points may have poor performance • E.g., distribution distance between these data points and a reference distribution • E.g., outlier detection • E.g., weak supervision (a la Snorkel) • E.g., metrics about your users (like engagement) A digression: approximate performance metrics 27
  • 28.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin The Evaluation Store 28 Collect data Clean and label Train Report Select problem Test Deploy Monitor Eval Store
  • 29.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin The Evaluation Store 29 Collect data Clean and label Train Report Select problem Test Deploy Monitor Train Eval Store • Register data distribution and performance for this model • Warn us if training data looks too different than prod
  • 30.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin The Evaluation Store 30 Collect data Clean and label Train Report Select problem Deploy Monitor Train Test Eval Store • Register performance for this model on all test slices • Pull historical that has been flagged as “interesting” (e.g., gave another model trouble) • Pull definitions of slices
  • 31.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Test The Evaluation Store 31 Collect data Clean and label Train Report Select problem Monitor Train Deploy Eval Store • Run a shadow test or AB test by pulling the diff in model performance between versions • Log data and approximate performance back to the eval store
  • 32.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Deploy Test The Evaluation Store 32 Collect data Clean and label Train Report Select problem Train Monitor Eval Store • Fire an alert when approximate performance on any of our slices dips below a threshold
  • 33.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Monitor Deploy Test The Evaluation Store 33 Clean and label Train Report Select problem Train Collect data Eval Store • Log more data with low or uncertain approximate performance
  • 34.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Collect data Monitor Deploy Test The Evaluation Store 34 Train Report Select problem Train Clean and label Eval Store • Inspect & label data with low approximate performance
  • 35.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Clean and label Collect data Monitor Deploy Test The Evaluation Store 35 Train Report Select problem Train Eval Store • Retrain when approximate performance dips below a threshold
  • 36.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Reduce organization friction. Get stakeholders (ML eng, ML research, PM, MLOps, etc) on the same page about metric and slice definitions • Deploy models more confidently. Evaluate metrics and slices consistently in testing and prod. Make the metrics visible to stakeholders • Catch production bugs faster. Catch degradations across any slice, and drill down to the data that caused the degradation • Reduce data-related costs. Collect and label production data more intelligently • Make your model better. Decide when to retrain. Pick the right data to retrain on. What could an eval store help you with? 36
  • 37.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Feature store is indexed by feature, eval store is indexed by model • A model taking a feature as input doesn’t mean that it looks at the entire distribution • A “poor quality” feature has different effects on different models • Not all data will come through the feature store • The two should talk to each other! Shouldn’t the feature store do this? 37
  • 38.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • Yes • The hard part here is approximating how well your model might be performing right now • That’s ML monitoring Wait, isn’t this just ML monitoring? 38
  • 39.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • No • Eval store should provide a consistent view of online and offline performance • Eval store is tightly integrated into the entire MLOps stack • Eval store keeps track of what data caused questions performance, so it can be used for testing and retraining Wait, isn’t this just ML monitoring? 39
  • 40.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin ML monitoring 40 Evaluation Production Training Monitoring
  • 41.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Eval store 41 Evaluation Production Training Eval store
  • 42.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Case study 1: the Tesla data engine 42 youtube.com/watch?t=7714&v=Ucp0TTmvqOE
  • 43.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Case study 2: TFX data validation 43 https://mlsys.org/Conferences/2019/doc/2019/167.pdf
  • 44.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin Case study 3: Overton (Apple) 44 https://machinelearning.apple.com/research/overton
  • 45.
    Missing Link inML Infrastructure — SF Big Data Analytics Josh Tobin • To turn ML into a product engineering discipline, we need an infrastructure stack that helps create a data flywheel • What’s still missing? • Granular, online-offline understanding of model performance • Orchestrating data and models throughout the whole loop • Maybe the Evaluation Store could help A Missing Link in the ML Infra Stack? 45