DevOps for DataScience

DevOps for Data Science
by Stepan Pushkarev
CTO of Hydrosphere.io

DevOps is a catchy buzzword to optimise things

© Josh Wills
http://www.slideshare.net/g33ktalk/dataengconf-sf16-bridging-the-gap-between-data-science-and-data-engineerin
g

Is there life after marriage data science?
Dating, Flowers,
Dreams
Marriage
Happily lived
forever?
Collect & prepare
data
Build ML Model

This talk is for people who are married aware of
“other 99% of data science”
Dating, Flowers,
Dreams
Marriage
Happily lived
forever?
Collect &
prepare data
Build ML Model

This talk is NOT about
- Setting up Apache Spark/Hadoop cluster
- Configuring CI/CD tools like Jenkins
- Configuring monitoring tools & dashboards
- Agile/DevOps brainwashing & consulting story

Agenda
- Challenges in deploying analytics into
production
- Deploying analytics as a service
- Feedback loops: testing, monitoring,
analytics of analytics

Why do companies hire data scientists?

Why do companies hire data scientists?
To make products smarter.

What is a deliverable of data scientist and data
engineer?

What is a deliverable of data scientist?
Academic
paper?
ML Model? R/Python
script?
Jupiter
Notebook?
BI
Dashboard?

What has to be a deliverable of data scientist?
Data pipelines and machine
learning models that deployed as
pluggable, testable, supportable,
monitorable analytics services.

Option 1: Engineer to implement academic paper

Option 2: Engineer to re-implement R/Python script

Option 3: Run notebook as it is using cron

Option 4: Build software to eat the world Data Science

Eating data science
© Daniel Tunkelang - Where should you put your data scientists? -
www.slideshare.net/dtunkelang/where-should-you-put-your-data-scientists
Step 1 (management): Integrate data scientists into
cross-functional teams

Eating data science
Step 2 (operations): Make environments scalable
and elastic. Finally.

Eating data science
Step 3: Make data scientists to write less code

Eating data science
Step 4: Deploy analytics as services

Step 5: Use feedback loops: testing, monitoring,
analytics for analytics
Build ML Model
Test
Monitor,
maintain,
analyze
Deploy as a service
Collect & prepare
data

Deploying analytics as a service
- Defines deliverable for Data Scientist / Data Engineer.
- Plugs analytics into end-to-end products through API.
- With the right tooling allows Data Scientist to deploy it in self
serve

Look around - proprietary ML based APIs
- Alchemy API
- Google Prediction API
- Cloud Vision API
- Azure ML
Can we do our own on top of Apache Spark?

Bad Practice #1. Business logic in Spark? WTF?

Bad Practice #2. Database as API
Execute reporting job
Mark Job as complete &
save result
Poll for new tasks
Poll for resultSet a flag to build a report

Bad Practice #3. Low level HTTP API
When Data Scientists
design an API...

Hydrosphere Mist - a service for exposing analytics
jobs and machine learning models as web services

Types of analytics services
- Enterprise Analytics services
- Reactive or Streaming services
- Realtime ML services

Enterprise analytics services
- Could not be
pre-calculated
- On-demand
parametrized jobs
- Requires large scale
processing
- Reporting
- Simulation (pricing, bank
stress testing, taxi rides)
- Forecasting (ad
campaign, energy
savings, others)
- Ad-hoc analytics tools
for business users

Reactive or Streaming services

Reactive or Streaming Reporting services

Realtime Machine Learning Services
Train models in Apache Spark and deploy it for realtime
low latency serving/scoring with high throughput

PMML is not an option
Spark ML, TensorFlow, H2O, Vowpal Wabbit, and every new ML
library invents uses own serialisation format

Format is not an issue if we re-define a deliverable for
ML model
xml, json, parquet, pojo, other
Single row Serving / Scoring
layer
Large Scale,
Batch
processing
engine
Monitoring,
testing
integration
Deliverable artifact for Machine Learning Model

Repository
Zooming out
MLLib model TensorFlow model Other model
Unified Serving/Scoring API

Agenda
- Challenges in deploying analytics into
production
- Deploying analytics as a service
- Feedback loops: testing,
monitoring, analytics of analytics

Testing, monitoring, analytics of analytics
- Poorly discussed in community.
- We are in production, baby!
- Regression.
- State matters. Model lifetime is limited.
- Data drifts, pipelines and model fail silently.
● Saves time
● Saves money
● Saves lifes

TDD world does not work here
Pff… easy:
- Unit tests - by platform developers
- Integration tests - often impossible
Not clear who and not clear how:
- Regression
- Data Validation
- Production testing
- Data and ML pipelines quality monitoring

Need either “Data QA” & “Data Ops” people
or … AI
(formula for the next 10 000 startups - take something and add AI)

Smart data structures and dumb code works a lot
better than the other way around

Can we develop DSL and Data Structure which is
smart enough to learn from data patterns, trends and
anomalies to be self-QAed?

QA view: a universe vs. big data analytics system
People observe and monitor signals from stars to
check that universe is not broken today

And Marijuana to make sense out of it

Metrics processing, monitoring, correlation insights...
...Isn’t it a big data analytics task on its own?

ML pipeline Kafka
Analytics jobs
for metrics
Emit Metrics
Stream it back
into Spark
Context
Use insights to
make our data
structures
smart
Solution: loop of analytics for analytics

Benefits
● Don’t need to talk to Ops! :)
● Already have Apache Spark and Kafka in place
● Data Scientist in the loop!
● Unlimited flexibility in analytics, correlation and
using ML for ML
● Models could feeded back into Smart self
QA-ed data structures.

Hydrosphere Swirl - a system that creates a swirl of
analytics for analytics

Original ML
pipeline
Kafka
Streaming or
Batch Swirl
jobs
Hydrosphere
Swirl
Plug, modify,
deploy, run jobs &
consume results
Metrics
definition,
Notebook
integration
Hydrosphere
Mist
(1) Emit metrics
Hydrosphere Swirl: Vision

Classify by
sentiment
Twitter
Prepare
Data
Serve ads to
user
Hydrosphere Swirl
Invalid records 10/sec 2k/sec0.8Ratio Clicks
Swirl Demo: Serve Ads to users with positive Tweets

Classify by
sentiment
Twitter
Ingest &
transform
Serve ads to
user
Hydrosphere Swirl
Invalid records 20k/sec 10/sec0.2 Clicks
Data pipeline
is broken
Ratio

Twitter
Ingest &
transform
Serve ads to
user
Hydrosphere Swirl
Invalid records 10/sec 10/sec0.2 Clicks
New ML model
deployment
Deployed
bug in ML
code
Ratio

Thank you
Looking for
- Feedback
- Advisors, mentors & partners
- Pilots and early adopters
Stay in touch
- @hydrospheredata
- https://github.com/Hydrospheredata
- http://hydrosphere.io/
- spushkarev@hydrosphere.io

DevOps for DataScience

More Related Content

What's hot

Similar to DevOps for DataScience

More from Stepan Pushkarev

Recently uploaded

DevOps for DataScience