A missing link in the ML infrastructure stack?

A Missing Link in the ML
Infrastructure Stack
Josh Tobin
Stealth Startup, UC Berkeley, Former OpenAI

Machine Learning is now a
product engineering
discipline

Missing Link in ML Infrastructure — SF Big Data Analytics
Josh Tobin
How did we get here?
4
ML analytics
2000s
• Simple models run offline
on medium to large
datasets to produce
reports
• Value comes from
incorporating model
insights into decisions
ML hype
2010s
• Complicated models
trained on massive
datasets to produce
papers
• Value comes from
marketing potential of
high-profile research
output
ML products
2020s?
• Reproducibility, scalability,
and maintainability over
complexity
• Value comes from models
improving the business’s
products or services

Josh Tobin
ML products require a fundamentally new process
5
“Flat-earth” ML
Collect data
Clean and
label
Train Report
Select problem

Josh Tobin
ML products require a fundamentally new process
6
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
ML Product Engineering

Josh Tobin
ML teams that don’t make the transition die
7

Josh Tobin
• Other disciplines will catch up to model training in prestige and
pay
• The three Ps (papers, pie charts, PoCs) are no longer enough
What does it mean for you?
8

Josh Tobin
Those that make the transition will create amazing things
9
• Autonomous Vehicles
• Real-time translation
• Drug discovery
• Marketing automation
• Personalization
• Document understanding
• Etc

Josh Tobin
• Run online and in real-time
• Deal with constantly evolving data distributions
• Handle messy, long-tail real world data
• Make predictions autonomously or semi-autonomously
Unlike flat-earth ML, ML products often:
10

Josh Tobin
• Run online and in real-time
Host and serve models with low latency
• Deal with constantly evolving data distributions
Retrain models frequently, even continuously
• Handle messy, long-tail real world data
Inspect your data scalable, manage slices and edge cases
• Make predictions autonomously or semi-autonomously
Quickly catch and diagnose bugs and distribution changes
This implies new ops & infra demands
11

Josh Tobin
Is the infrastructure stack keeping up?
12
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Reproducible pipelines
Training infrastructure
Experiment management
Train

Josh Tobin
13
Collect data
Clean and
label
Train Report
Select problem
Deploy
Monitor
Train
What’s still hard?
• Surfacing areas of poor
performance
• Managing all your test
cases
Model perf exploration
Test
CI/CD tools
Explainability tools

Josh Tobin
Test
14
Collect data
Clean and
label
Train Report
Select problem
Monitor
Train
Deploy
Model serving
Feature stores
• Experimentation (AB
tests, shadow tests)
• Online / offline
consistency

Josh Tobin
Deploy Test
15
Collect data
Clean and
label
Train Report
Select problem Train
System monitoring
• Performance
monitoring
• Drift is still a bit of an
art
Monitor
Data quality / drift
Deequ

Josh Tobin
Monitor Deploy Test
16
Clean and
label
Train Report
• Subsampling data
• Connecting the data
back to the model
Collect data
Data lakes, warehouses

Josh Tobin
Collect data
Monitor Deploy Test
17
Train Report
• What data should I
label?
• What data should I
train on?
Clean and
label
Labeling tools & services Active learning tools

Josh Tobin
Clean and
label
Collect data
Monitor Deploy Test
18
Train Report
Select problem
• How do I know when to
retrain?
• (Retraining online)
Train

Josh Tobin
• Many tools emerging to address the problems of ML product
engineering
• Problems arise at the boundaries of the tools, especially anything
that shepherds data through the process
• At all stages, granular understanding of model performance is
lacking
Takeaways
19

Josh Tobin
A central place to store and query online and offline ground truth
and approximate model quality metrics
The Evaluation Store
20

Josh Tobin 21
Eval Store
Training Evaluation
Production
Data and prediction profiles
Metric & slice definitions
Feedback on model predictions
Feature store
Model hub

Josh Tobin
• Subset of models in the store
• Subset of metrics in the store
• Subset of slices in the store
• Specification of the window of data
Querying the evaluation store
22
What form do queries take?

Josh Tobin
23
What form do queries take? E.g.,
What is the importance-weighted
average drift across all of my features in
my production model in the last 60
minutes?
Monitoring

Josh Tobin
24
How much worse is the my accuracy in
the last 7 days than it was during
training?
Monitoring

Josh Tobin
25
How do all of the metrics compare for
model A and model B across all slices in
my main evaluation set?
Testing

Josh Tobin
26
How do my business metrics compare
for model A and model B in the last 60
minutes
AB testing

Josh Tobin
• In a perfect world, we would know right away how well the model performs on all data
points seen in production
• In the real world, labels are unreliable, expensive, and delayed
• Approximate performance metrics are ways to guess which data points may have poor
performance
• E.g., distribution distance between these data points and a reference distribution
• E.g., outlier detection
• E.g., weak supervision (a la Snorkel)
• E.g., metrics about your users (like engagement)
A digression: approximate performance metrics
27

Josh Tobin
28
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Eval Store

Josh Tobin
29
Collect data
Clean and
label
Train Report
Select problem
Test
Deploy
Monitor
Train
Eval Store
• Register data distribution and
performance for this model
• Warn us if training data looks
too different than prod

Josh Tobin
30
Collect data
Clean and
label
Train Report
Select problem
Deploy
Monitor
Train
Test
Eval Store
• Register performance for this
model on all test slices
• Pull historical that has been
flagged as “interesting” (e.g.,
gave another model trouble)
• Pull definitions of slices

Josh Tobin
Test
31
Collect data
Clean and
label
Train Report
Select problem
Monitor
Train
Deploy
Eval Store
• Run a shadow test or AB test by
pulling the diff in model
performance between versions
• Log data and approximate
performance back to the eval
store

Josh Tobin
Deploy Test
32
Collect data
Clean and
label
Train Report
Monitor
Eval Store
• Fire an alert when approximate
performance on any of our
slices dips below a threshold

Josh Tobin
Monitor Deploy Test
33
Clean and
label
Train Report
Collect data
Eval Store
• Log more data with low or
uncertain approximate
performance

Josh Tobin
Collect data
Monitor Deploy Test
34
Train Report
Clean and
label
Eval Store
• Inspect & label data with low
approximate performance

Josh Tobin
Clean and
label
Collect data
Monitor Deploy Test
35
Train Report
Eval Store
• Retrain when approximate
performance dips below a
threshold

Josh Tobin
• Reduce organization friction. Get stakeholders (ML eng, ML research, PM, MLOps,
etc) on the same page about metric and slice definitions
• Deploy models more confidently. Evaluate metrics and slices consistently in
testing and prod. Make the metrics visible to stakeholders
• Catch production bugs faster. Catch degradations across any slice, and drill
down to the data that caused the degradation
• Reduce data-related costs. Collect and label production data more intelligently
• Make your model better. Decide when to retrain. Pick the right data to retrain on.
What could an eval store help you with?
36

Josh Tobin
• Feature store is indexed by feature, eval store is indexed by model
• A model taking a feature as input doesn’t mean that it looks at the
entire distribution
• A “poor quality” feature has different effects on different models
• Not all data will come through the feature store
• The two should talk to each other!
Shouldn’t the feature store do this?
37

Josh Tobin
• Yes
• The hard part here is approximating how well your model might
be performing right now
• That’s ML monitoring
Wait, isn’t this just ML monitoring?
38

Josh Tobin
• No
• Eval store should provide a consistent view of online and offline
performance
• Eval store is tightly integrated into the entire MLOps stack
• Eval store keeps track of what data caused questions
performance, so it can be used for testing and retraining
Wait, isn’t this just ML monitoring?
39

Josh Tobin
ML monitoring
40
Evaluation Production
Training Monitoring

Josh Tobin
Eval store
41
Evaluation Production
Training
Eval store

Josh Tobin
Case study 1: the Tesla data engine
42
youtube.com/watch?t=7714&v=Ucp0TTmvqOE

Josh Tobin
Case study 2: TFX data validation
43
https://mlsys.org/Conferences/2019/doc/2019/167.pdf

Josh Tobin
Case study 3: Overton (Apple)
44
https://machinelearning.apple.com/research/overton

Josh Tobin
• To turn ML into a product engineering discipline, we need an
infrastructure stack that helps create a data flywheel
• What’s still missing?
• Granular, online-offline understanding of model performance
• Orchestrating data and models throughout the whole loop
• Maybe the Evaluation Store could help
A Missing Link in the ML Infra Stack?
45

A missing link in the ML infrastructure stack?

More Related Content

Similar to A missing link in the ML infrastructure stack?

More from Chester Chen

Recently uploaded

A missing link in the ML infrastructure stack?