Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Learning

©2018 Dataiku, Inc. | www.dataiku.com | contact@dataiku.com | @dataiku
DATA SCIENCE SALON
MIAMI 2018
Stop Wasting Time – Case Studies in
Production Machine Learning
Yashas Vaidya
Technical Lead, Alliances-US
yashas.vaidya@dataiku.com

2 X 2
• When putting things into production ...
• why do Data Scientists take too long?
• why does DS takes too long?
• Two case studies getting around those issues
• Speeding up ETL steps to create insights
• Case: Starting with a template and making reusable analytics
Two questions, two case studies

Data Science in production
• Business lines are often the sponsors and need to see ROI from projects
• Creates value across the organization and provide DS teams with recognition
• From strategic insights to augmented (or automated) decision making
Why the focus?

Wait a minute …
Where is the world is Ken?

Re-introduction
• Evangelizer in the Alliances team,
• Training and empowering consulting and technical partners
• ABD ...
I’m Yashas and I am ...

Data Scientists take too long
They mostly spend their time doing ETL
Source: 2017 Data Scientist Report
CrowdFlower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html

Why do Data Scientists take so long
• H1: Data scientists have a hard time getting their hands on good quality data
and thus must scrub away
• H2: Given their background, they are actually pretty good at cleaning and
organizing data. So they focus on what they can do well and control.
Several hypotheses

Some evidence for both
Data scientists are usually happy data scientists, overwhelmed by data
Source: 2017 Data Scientist Report
CrowdFlower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html

Getting to production is hard to do
• different teams in charge
• lack of established platforms or frameworks
• overall low maturity on deployment skills and experience
McKinsey Institute: « less than 10% of data science projects are deployed into
production […] with average deployment times of 9 to 12 months »
Why is data science hard to productionalize?

First case study today
Regulatory Compliance

A bit of context
• New European credit reporting regulation – AnaCredit
• Every financial institution must report a monthly
dataset of loans over 25k with up to 95 attributes
• instrument data
• counterparty data
• liability data…
• Progressive rollout in 2018 for 8 countries
• Large Banks and Insurance companies have
provisioned around 10-15M € for the project
→ How can we provide a innovative, differentiating solution to this problem ?
Anacredit regulatory reporting

Building the Anacredit solution
• Key Usage Scenarios
• Build and automate financial reporting and in same format across 8 countries
• Take into account corrections and quality tests, as well as rejections
• Provide broader reporting on data quality and insights on portfolio management
• A best-of-breed approach to the project
• Data Management & Predictive – Dataiku DSS
• Regulatory watch and Risk Compliance – PwC
• Self Service Analytics, Insights and Reporting – Tableau Software
Usage Scenarios and Approach

Data Management
Focus on Agility and Repeatability
Data Processing Flow for
Template 1
Working « prototype » of the
dataset and templates within 2
months
• 2 Data Engineers
• 1 Business Analyst
• 1 Risk Analyst
• Agile, documented, re-usable,
runs « at scale » in the
company’s IS
• Articulated with ETL and DI
tools

Data Management
Instrument Data

Data Management
Counterparty Data

Data Management
Processing and Delivevering XML Templates

Analytics and Reporting
• Data Quality and Compliance
• Evolutions over time of errors and rejects
• Risk portfolio reporting
• Predictive insights
End user dashboards, improve analytics culture

Machine Learning
Fast exploration of multiple use cases to build
business cases
• Automatic enrichment of data through fuzzy
matching of LEI records
• Predicting outcome for late payments across
loan fifecycle
• Anomaly detection on loans (Isolation Forest)
Anacredit dataset is a great playground for ML

Real-time recommendation system at scale
• Power a sales platform used by 4,000 different clients
• First project
• Provide customized recommendations (3 models per client)
• Recommendations should available in real time (API endpoint)
A production conundrum

How is Dataiku used in production?
What does it mean to a customer of a customer?
Data
DSS
Prediction
Customer

What had to be done?
Design Environment
Automation Environment
API Environment
Main Proj
Proj A
Proj B
Service A
Service B
Service A
Service B
Data table A
Data table B
Dev Data table

What are the technical issues?
* We must take one Project in Design
and Turn it into many in Automation
* We must connect to one ingestion
dataset in Design and many in
Automation
* We must turn one service in Design
into many different services on the
API nodes
One to Many Projects
Solution: Create a bundle and
upload it to Automation
multiple times
Solution: Connect to different
databases on the Automation
node
Solution: Each new project
generates its own service on the
Automation node

January 2013
Dataiku founded
Paris
February 2014
DSS 1.0
The 1st tool worldwide
integrating visual data
preparation and
machine learning
January 2015
20 Employees
30 Customers
$4M Seed
April 2015
DSS 2.0
Real-time collaboration
Spark Integration
April 2015
Office in New York
October 2016
DSS 3.0
300% Yearly Growth
$14M Series A
February 2017
DSS 4.0
Office in London
August
2017
$28M Series B
100+ Employees
100+ Customers
A brief history of
Dataiku and DSS

Consumer Goods
Consumer Electronics
Technology
Financial Services
Healthcare Media
Consulting
Transportation
Travel
E-Retail
125+Customers
3XYoY Growth
in 2017
150%Yearly Net Retention
Powering over 125 customers across 20 countries and multiple industries
Customers - Leaders in their industries
#1 Insurance Brand
#1 Pharma Brand
#1 Financial Information Company
#1 Flash Sales Company
#1 Car Sharing Company
#1 Cosmetics Company
#3 CPG Company

Horizontal Collaboration vs. Vertical Collaboration
Data Engineer
Line-of-
business
Data
Consumer
Data EngineerData Engineer
Data AnalystData Analyst
Data ScientistData ScientistData Scientist
Data Analyst
Business
Leader
Data
Consumer
Line-of-
business
Data
Consumer
Data Engineer
Line-of-
business
Data
Consumer
Data Engineer
Data Analyst
Data ScientistData Scientist
Data Analyst
Business
Leader
Data
Consumer
Line-of-
business
Data
Consumer
Data Engineer
Data Analyst
Data Scientist

Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Learning

More Related Content

What's hot

Similar to Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Learning

More from Formulatedby

Recently uploaded

Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Learning