©2018 Dataiku, Inc. | www.dataiku.com | contact@dataiku.com | @dataiku
DATA SCIENCE SALON
MIAMI 2018
Stop Wasting Time – Case Studies in
Production Machine Learning
Yashas Vaidya
Technical Lead, Alliances-US
yashas.vaidya@dataiku.com
2 X 2
• When putting things into production ...
• why do Data Scientists take too long?
• why does DS takes too long?
• Two case studies getting around those issues
• Speeding up ETL steps to create insights
• Case: Starting with a template and making reusable analytics
Two questions, two case studies
Data Science in production
• Business lines are often the sponsors and need to see ROI from projects
• Creates value across the organization and provide DS teams with recognition
• From strategic insights to augmented (or automated) decision making
Why the focus?
Wait a minute …
Where is the world is Ken?
Re-introduction
• Evangelizer in the Alliances team,
• Training and empowering consulting and technical partners
• ABD ...
I’m Yashas and I am ...
Data Scientists take too long
They mostly spend their time doing ETL
Source: 2017 Data Scientist Report
CrowdFlower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html
Why do Data Scientists take so long
• H1: Data scientists have a hard time getting their hands on good quality data
and thus must scrub away
• H2: Given their background, they are actually pretty good at cleaning and
organizing data. So they focus on what they can do well and control.
Several hypotheses
Some evidence for both
Data scientists are usually happy data scientists, overwhelmed by data
Source: 2017 Data Scientist Report
CrowdFlower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html
Getting to production is hard to do
• different teams in charge
• lack of established platforms or frameworks
• overall low maturity on deployment skills and experience
McKinsey Institute: « less than 10% of data science projects are deployed into
production […] with average deployment times of 9 to 12 months »
Why is data science hard to productionalize?
Case Study 1
First case study today
Regulatory Compliance
A bit of context
• New European credit reporting regulation – AnaCredit
• Every financial institution must report a monthly
dataset of loans over 25k with up to 95 attributes
• instrument data
• counterparty data
• liability data…
• Progressive rollout in 2018 for 8 countries
• Large Banks and Insurance companies have
provisioned around 10-15M € for the project
→ How can we provide a innovative, differentiating solution to this problem ?
Anacredit regulatory reporting
Building the Anacredit solution
• Key Usage Scenarios
• Build and automate financial reporting and in same format across 8 countries
• Take into account corrections and quality tests, as well as rejections
• Provide broader reporting on data quality and insights on portfolio management
• A best-of-breed approach to the project
• Data Management & Predictive – Dataiku DSS
• Regulatory watch and Risk Compliance – PwC
• Self Service Analytics, Insights and Reporting – Tableau Software
Usage Scenarios and Approach
Data Management
Focus on Agility and Repeatability
Data Processing Flow for
Template 1
Working « prototype » of the
dataset and templates within 2
months
• 2 Data Engineers
• 1 Business Analyst
• 1 Risk Analyst
• Agile, documented, re-usable,
runs « at scale » in the
company’s IS
• Articulated with ETL and DI
tools
Data Management
Instrument Data
Data Management
Counterparty Data
Data Management
Open Data
Data Management
Processing and Delivevering XML Templates
Analytics and Reporting
• Data Quality and Compliance
• Evolutions over time of errors and rejects
• Risk portfolio reporting
• Predictive insights
End user dashboards, improve analytics culture
Machine Learning
Fast exploration of multiple use cases to build
business cases
• Automatic enrichment of data through fuzzy
matching of LEI records
• Predicting outcome for late payments across
loan fifecycle
• Anomaly detection on loans (Isolation Forest)
Anacredit dataset is a great playground for ML
Case Study 2
Real-time recommendation system at scale
• Power a sales platform used by 4,000 different clients
• First project
• Provide customized recommendations (3 models per client)
• Recommendations should available in real time (API endpoint)
A production conundrum
How is Dataiku used in production?
What does it mean to a customer of a customer?
Data
DSS
Prediction
Customer
What had to be done?
Design	Environment
Automation	Environment
API	Environment
Main	Proj
Proj	A
Proj	B
Service	A
Service	B
Service	A
Service	B
Data	table	A
Data	table	B
Dev	Data	table
What are the technical issues?
* We must take one Project in Design
and Turn it into many in Automation
* We must connect to one ingestion
dataset in Design and many in
Automation
* We must turn one service in Design
into many different services on the
API nodes
One to Many Projects
Solution: Create a bundle and
upload it to Automation
multiple times
Solution: Connect to different
databases on the Automation
node
Solution: Each new project
generates its own service on the
Automation node
January 2013
Dataiku founded
Paris
February 2014
DSS 1.0
The 1st tool worldwide
integrating visual data
preparation and
machine learning
January 2015
20 Employees
30 Customers
$4M Seed
April 2015
DSS 2.0
Real-time collaboration
Spark Integration
April 2015
Office in New York
October 2016
DSS 3.0
300% Yearly Growth
$14M Series A
February 2017
DSS 4.0
Office in London
August
2017
$28M Series B
100+ Employees
100+ Customers
A brief history of
Dataiku and DSS
Consumer Goods
Consumer Electronics
Technology
Financial Services
Healthcare Media
Consulting
Transportation
Travel
E-Retail
125+Customers
3XYoY Growth
in 2017
150%Yearly Net Retention
Powering over 125 customers across 20 countries and multiple industries
Customers - Leaders in their industries
#1 Insurance Brand
#1 Pharma Brand
#1 Financial Information Company
#1 Flash Sales Company
#1 Car Sharing Company
#1 Cosmetics Company
#3 CPG Company
Horizontal Collaboration vs. Vertical Collaboration
Data Engineer
Line-of-
business
Data
Consumer
Data EngineerData Engineer
Data AnalystData Analyst
Data ScientistData ScientistData Scientist
Data Analyst
Business
Leader
Data
Consumer
Line-of-
business
Data
Consumer
Data Engineer
Line-of-
business
Data
Consumer
Data Engineer
Data Analyst
Data ScientistData Scientist
Data Analyst
Business
Leader
Data
Consumer
Line-of-
business
Data
Consumer
Data Engineer
Data Analyst
Data Scientist
©2018 Dataiku, Inc. | www.dataiku.com | contact@dataiku.com | @dataiku
Thanks for listening!

Data Science Salon: Quit Wasting Time – Case Studies in Production Machine Learning

  • 1.
    ©2018 Dataiku, Inc.| www.dataiku.com | contact@dataiku.com | @dataiku DATA SCIENCE SALON MIAMI 2018 Stop Wasting Time – Case Studies in Production Machine Learning Yashas Vaidya Technical Lead, Alliances-US yashas.vaidya@dataiku.com
  • 2.
    2 X 2 •When putting things into production ... • why do Data Scientists take too long? • why does DS takes too long? • Two case studies getting around those issues • Speeding up ETL steps to create insights • Case: Starting with a template and making reusable analytics Two questions, two case studies
  • 3.
    Data Science inproduction • Business lines are often the sponsors and need to see ROI from projects • Creates value across the organization and provide DS teams with recognition • From strategic insights to augmented (or automated) decision making Why the focus?
  • 4.
    Wait a minute… Where is the world is Ken?
  • 5.
    Re-introduction • Evangelizer inthe Alliances team, • Training and empowering consulting and technical partners • ABD ... I’m Yashas and I am ...
  • 6.
    Data Scientists taketoo long They mostly spend their time doing ETL Source: 2017 Data Scientist Report CrowdFlower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html
  • 7.
    Why do DataScientists take so long • H1: Data scientists have a hard time getting their hands on good quality data and thus must scrub away • H2: Given their background, they are actually pretty good at cleaning and organizing data. So they focus on what they can do well and control. Several hypotheses
  • 8.
    Some evidence forboth Data scientists are usually happy data scientists, overwhelmed by data Source: 2017 Data Scientist Report CrowdFlower: https://visit.crowdflower.com/WC-2017-Data-Science-Report_LP.html
  • 9.
    Getting to productionis hard to do • different teams in charge • lack of established platforms or frameworks • overall low maturity on deployment skills and experience McKinsey Institute: « less than 10% of data science projects are deployed into production […] with average deployment times of 9 to 12 months » Why is data science hard to productionalize?
  • 10.
  • 11.
    First case studytoday Regulatory Compliance
  • 12.
    A bit ofcontext • New European credit reporting regulation – AnaCredit • Every financial institution must report a monthly dataset of loans over 25k with up to 95 attributes • instrument data • counterparty data • liability data… • Progressive rollout in 2018 for 8 countries • Large Banks and Insurance companies have provisioned around 10-15M € for the project → How can we provide a innovative, differentiating solution to this problem ? Anacredit regulatory reporting
  • 13.
    Building the Anacreditsolution • Key Usage Scenarios • Build and automate financial reporting and in same format across 8 countries • Take into account corrections and quality tests, as well as rejections • Provide broader reporting on data quality and insights on portfolio management • A best-of-breed approach to the project • Data Management & Predictive – Dataiku DSS • Regulatory watch and Risk Compliance – PwC • Self Service Analytics, Insights and Reporting – Tableau Software Usage Scenarios and Approach
  • 14.
    Data Management Focus onAgility and Repeatability Data Processing Flow for Template 1 Working « prototype » of the dataset and templates within 2 months • 2 Data Engineers • 1 Business Analyst • 1 Risk Analyst • Agile, documented, re-usable, runs « at scale » in the company’s IS • Articulated with ETL and DI tools
  • 15.
  • 16.
  • 17.
  • 18.
    Data Management Processing andDelivevering XML Templates
  • 19.
    Analytics and Reporting •Data Quality and Compliance • Evolutions over time of errors and rejects • Risk portfolio reporting • Predictive insights End user dashboards, improve analytics culture
  • 20.
    Machine Learning Fast explorationof multiple use cases to build business cases • Automatic enrichment of data through fuzzy matching of LEI records • Predicting outcome for late payments across loan fifecycle • Anomaly detection on loans (Isolation Forest) Anacredit dataset is a great playground for ML
  • 21.
  • 22.
    Real-time recommendation systemat scale • Power a sales platform used by 4,000 different clients • First project • Provide customized recommendations (3 models per client) • Recommendations should available in real time (API endpoint) A production conundrum
  • 24.
    How is Dataikuused in production? What does it mean to a customer of a customer? Data DSS Prediction Customer
  • 25.
    What had tobe done? Design Environment Automation Environment API Environment Main Proj Proj A Proj B Service A Service B Service A Service B Data table A Data table B Dev Data table
  • 26.
    What are thetechnical issues? * We must take one Project in Design and Turn it into many in Automation * We must connect to one ingestion dataset in Design and many in Automation * We must turn one service in Design into many different services on the API nodes One to Many Projects Solution: Create a bundle and upload it to Automation multiple times Solution: Connect to different databases on the Automation node Solution: Each new project generates its own service on the Automation node
  • 27.
    January 2013 Dataiku founded Paris February2014 DSS 1.0 The 1st tool worldwide integrating visual data preparation and machine learning January 2015 20 Employees 30 Customers $4M Seed April 2015 DSS 2.0 Real-time collaboration Spark Integration April 2015 Office in New York October 2016 DSS 3.0 300% Yearly Growth $14M Series A February 2017 DSS 4.0 Office in London August 2017 $28M Series B 100+ Employees 100+ Customers A brief history of Dataiku and DSS
  • 28.
    Consumer Goods Consumer Electronics Technology FinancialServices Healthcare Media Consulting Transportation Travel E-Retail 125+Customers 3XYoY Growth in 2017 150%Yearly Net Retention Powering over 125 customers across 20 countries and multiple industries Customers - Leaders in their industries #1 Insurance Brand #1 Pharma Brand #1 Financial Information Company #1 Flash Sales Company #1 Car Sharing Company #1 Cosmetics Company #3 CPG Company
  • 29.
    Horizontal Collaboration vs.Vertical Collaboration Data Engineer Line-of- business Data Consumer Data EngineerData Engineer Data AnalystData Analyst Data ScientistData ScientistData Scientist Data Analyst Business Leader Data Consumer Line-of- business Data Consumer Data Engineer Line-of- business Data Consumer Data Engineer Data Analyst Data ScientistData Scientist Data Analyst Business Leader Data Consumer Line-of- business Data Consumer Data Engineer Data Analyst Data Scientist
  • 30.
    ©2018 Dataiku, Inc.| www.dataiku.com | contact@dataiku.com | @dataiku Thanks for listening!