1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete
Pivotal Data Scientist
Structure Data 2016
Data Science at Scale on MPP
Databases – Use Cases & Open Source
Tools
Joint work with Pivotal Data Science
2© Copyright 2016 Pivotal. All rights reserved.
Agenda
Ÿ  Introduction
Ÿ  Open Source Data Science Toolkit
Ÿ  Real world applications
–  Predictive maintenance of automobiles
–  Predicting insurance claims
–  Predicting customer churn
Ÿ  Data science deep-dive with Jupyter notebooks
–  Text analytics on MPP (github.com/vatsan)
–  Image processing on MPP (github.com/gautamsm)
3© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science
Our Charter:
Pivotal Data Science is Pivotal’s differentiated and
highly opinionated data-centric service delivery
organization (part of Pivotal Labs)
Our Goals:
Expedite customer time-to-value and ROI, by driving
business-aligned innovation and solutions assurance
within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full
spectrum of Pivotal Data technologies through best-in-
class data science and data engineering services, with
a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev
4© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Knowledge Development
5© Copyright 2016 Pivotal. All rights reserved.
Use Case: Preventive Maintenance for
Connected Vehicles
Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
Ÿ  Can the DTC data be leveraged to predict the presence of
potential problems in vehicles?
Ÿ  Set up a data science framework on the Pivotal analytics
environment that would enable the customer data science
team to continuously monitor problems in their vehicles
using DTC data
6© Copyright 2016 Pivotal. All rights reserved.
Problem Setup – Predicting Job Type from
Diagnostic Trouble Codes (DTCs)
Time
Job Type:
Transmission
Job Type:
Transmission
Engine
Job Type:
Body
DTC: B DTC:
B,
P, C
DTC: U
DTC: B DTC: B
DTC:
B, P, C, U
DTC:
P, B, U
DTC: P DTC: B DTC:
B,P
DTC:
B,P
Can the DTCs
observed here predict
this Job Type?
Can the DTCs observed
here predict this Job
Type?
Can the DTCs observed
here predict this Job
Type?
7© Copyright 2016 Pivotal. All rights reserved.
Data Parallelism
One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs.
Non Red
On Segment 1
Green vs.
Non Green
On Segment 2
Blue vs.
Non Blue
On Segment N
8© Copyright 2016 Pivotal. All rights reserved.
Model Scoring Pipeline
DTC: B DTC: B, P, C DTC: U
Body
Axle
Engine
Prob >=
Threshold
Prob >=
Threshold
Prob >=
Threshold
Model Caching
(GPDB/
HAWQ)
Real time
scoring
web or mobile app dashboard
Ingest
Sink
9© Copyright 2016 Pivotal. All rights reserved.
MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by
a particular field (or randomly)
10© Copyright 2016 Pivotal. All rights reserved.
IT TAKES MORE THAN
ONE TOOL
11© Copyright 2016 Pivotal. All rights reserved.
Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
ModelingTools
VisualizationTools
Platform
GemFire
12© Copyright 2016 Pivotal. All rights reserved.
Scalable, In-Database
Machine Learning
•  Open Source https://github.com/madlib/madlib
•  Works on Greenplum DB, Apache HAWQ and PostgreSQL
•  In active development by Pivotal
•  MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)
13© Copyright 2016 Pivotal. All rights reserved.
Functions
Supervised Learning
Regression Models
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Marginal Effects
•  Multinomial Regression
•  Ordinal Regression
•  Robust Variance, Clustered Variance
•  Support Vector Machines
Tree Methods
•  Decision Tree
•  Random Forest
Other Methods
•  Conditional Random Field
•  Naïve Bayes
Unsupervised Learning
•  Association Rules (Apriori)
•  Clustering (K-means)
•  Topic Modeling (LDA)
Statistics
Descriptive
•  Cardinality Estimators
•  Correlation
•  Summary
Inferential
•  Hypothesis Tests
Other Statistics
•  Probability Functions
Other Modules
•  Conjugate Gradient
•  Linear Solvers
•  PMML Export
•  Random Sampling
•  Term Frequency for Text
Time Series
•  ARIMA
Aug 2015
Data Types and Transformations
•  Array Operations
•  Dimensionality Reduction (PCA)
•  Encoding Categorical Variables
•  Matrix Operations
•  Matrix Factorization (SVD, Low Rank)
•  Norms and Distance Functions
•  Sparse Vectors
Model Evaluation
•  Cross Validation
Predictive Analytics Library
@MADlib_analytic
14© Copyright 2016 Pivotal. All rights reserved.
Use Case: Predicting insurance claim amounts
using structured and unstructured data
Ÿ  Using features from structured and unstructured data
sources associated with claims, build the capability to
predict claim amounts
15© Copyright 2016 Pivotal. All rights reserved.
Text analytics on MPP
Ÿ  Unstructured data in the
form of claim comments and
claim descriptions (text)
Ÿ  Use a bag-of-words
approach (unigrams,
bigrams)
Ÿ  tf-idf for more meaningful
insights
16© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through
this Jupyter
notebook
17© Copyright 2016 Pivotal. All rights reserved.
Use Case: Churn prediction
Ÿ  Build a churn model to predict
which customers are most likely
to churn
Ÿ  Provide insights into key factors
responsible for churn to
potentially intervene prior to
churn
18© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data
Ÿ  Aggregate weekly usage by user
Ÿ  Compute descriptive statistics
Ÿ  Extract features based on business expertise
19© Copyright 2016 Pivotal. All rights reserved.
Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for
building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
20© Copyright 2016 Pivotal. All rights reserved.
•  For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
•  The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X : X in Python, R, Java,
C/C++ and pgSQL
•  plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)
21© Copyright 2016 Pivotal. All rights reserved.
User Defined Functions (UDFs) in PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE	
  FUNCTION	
  seasonality	
  (x	
  float[])	
  
	
  	
  RETURNS	
  float[]	
  
AS	
  $$	
  
	
  	
  import	
  statsmodels.api	
  as	
  sm	
  
	
  	
  s	
  =	
  sm.tsa.seasonal_decompose(x).seasonal	
  	
  
	
  	
  return	
  s	
  
$$	
  LANGUAGE	
  plpythonu;	
  
SQL wrapper
SQL wrapper
Normal Python
22© Copyright 2016 Pivotal. All rights reserved.
Usage Time Series Data with PL/X
Ÿ  Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
Ÿ  Runs at scale through data parallelism
23© Copyright 2016 Pivotal. All rights reserved.
Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV
inside a PL/C function
24© Copyright 2016 Pivotal. All rights reserved.
Pivotal Data Science Blogs
1.  Scaling native (C++) apps on Pivotal MPP
2.  Predicting commodity futures through Tweets
3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4.  Using data science to predict TV viewer behavior
5.  Twitter NLP: Scaling part-of-speech tagging
6.  Distributed deep learning on MPP and Hadoop
7.  Multi-variate time series forecasting
8.  Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal
25© Copyright 2016 Pivotal. All rights reserved.
Thank You!
A NEW PLATFORM FOR A NEW ERA

Data Science at Scale on MPP databases - Use Cases & Open Source Tools

  • 1.
    1© Copyright 2016Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved. Esther Vasiete Pivotal Data Scientist Structure Data 2016 Data Science at Scale on MPP Databases – Use Cases & Open Source Tools Joint work with Pivotal Data Science
  • 2.
    2© Copyright 2016Pivotal. All rights reserved. Agenda Ÿ  Introduction Ÿ  Open Source Data Science Toolkit Ÿ  Real world applications –  Predictive maintenance of automobiles –  Predicting insurance claims –  Predicting customer churn Ÿ  Data science deep-dive with Jupyter notebooks –  Text analytics on MPP (github.com/vatsan) –  Image processing on MPP (github.com/gautamsm)
  • 3.
    3© Copyright 2016Pivotal. All rights reserved. Pivotal Data Science Our Charter: Pivotal Data Science is Pivotal’s differentiated and highly opinionated data-centric service delivery organization (part of Pivotal Labs) Our Goals: Expedite customer time-to-value and ROI, by driving business-aligned innovation and solutions assurance within Pivotal’s Data Fabric technologies. Drive customer adoption and autonomy across the full spectrum of Pivotal Data technologies through best-in- class data science and data engineering services, with a deep emphasis on knowledge transfer. Data Science Data Engineering App Dev
  • 4.
    4© Copyright 2016Pivotal. All rights reserved. Pivotal Data Science Knowledge Development
  • 5.
    5© Copyright 2016Pivotal. All rights reserved. Use Case: Preventive Maintenance for Connected Vehicles Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC) and vehicle status data to the Pivotal analytics environment Ÿ  Can the DTC data be leveraged to predict the presence of potential problems in vehicles? Ÿ  Set up a data science framework on the Pivotal analytics environment that would enable the customer data science team to continuously monitor problems in their vehicles using DTC data
  • 6.
    6© Copyright 2016Pivotal. All rights reserved. Problem Setup – Predicting Job Type from Diagnostic Trouble Codes (DTCs) Time Job Type: Transmission Job Type: Transmission Engine Job Type: Body DTC: B DTC: B, P, C DTC: U DTC: B DTC: B DTC: B, P, C, U DTC: P, B, U DTC: P DTC: B DTC: B,P DTC: B,P Can the DTCs observed here predict this Job Type? Can the DTCs observed here predict this Job Type? Can the DTCs observed here predict this Job Type?
  • 7.
    7© Copyright 2016Pivotal. All rights reserved. Data Parallelism One or more job on the same day Multi-labeling problem One-vs-rest classifiers built in parallel 1 0 0 1 0 1 0 Class 1 Class 2 Class 3 One-vs-Rest Classification Red vs. Non Red On Segment 1 Green vs. Non Green On Segment 2 Blue vs. Non Blue On Segment N
  • 8.
    8© Copyright 2016Pivotal. All rights reserved. Model Scoring Pipeline DTC: B DTC: B, P, C DTC: U Body Axle Engine Prob >= Threshold Prob >= Threshold Prob >= Threshold Model Caching (GPDB/ HAWQ) Real time scoring web or mobile app dashboard Ingest Sink
  • 9.
    9© Copyright 2016Pivotal. All rights reserved. MPP Architectural Overview Think of it as multiple PostGreSQL servers Segments/Workers Master Rows are distributed across segments by a particular field (or randomly)
  • 10.
    10© Copyright 2016Pivotal. All rights reserved. IT TAKES MORE THAN ONE TOOL
  • 11.
    11© Copyright 2016Pivotal. All rights reserved. Open Source Data Science Toolkit KEY LANGUAGES P L A T F O R M KEY TOOLS MLlib PL/X Pivotal Big Data Suite ModelingTools VisualizationTools Platform GemFire
  • 12.
    12© Copyright 2016Pivotal. All rights reserved. Scalable, In-Database Machine Learning •  Open Source https://github.com/madlib/madlib •  Works on Greenplum DB, Apache HAWQ and PostgreSQL •  In active development by Pivotal •  MADlib is now an Apache Software Foundation incubator project! Apache (incubating)
  • 13.
    13© Copyright 2016Pivotal. All rights reserved. Functions Supervised Learning Regression Models •  Cox Proportional Hazards Regression •  Elastic Net Regularization •  Generalized Linear Models •  Linear Regression •  Logistic Regression •  Marginal Effects •  Multinomial Regression •  Ordinal Regression •  Robust Variance, Clustered Variance •  Support Vector Machines Tree Methods •  Decision Tree •  Random Forest Other Methods •  Conditional Random Field •  Naïve Bayes Unsupervised Learning •  Association Rules (Apriori) •  Clustering (K-means) •  Topic Modeling (LDA) Statistics Descriptive •  Cardinality Estimators •  Correlation •  Summary Inferential •  Hypothesis Tests Other Statistics •  Probability Functions Other Modules •  Conjugate Gradient •  Linear Solvers •  PMML Export •  Random Sampling •  Term Frequency for Text Time Series •  ARIMA Aug 2015 Data Types and Transformations •  Array Operations •  Dimensionality Reduction (PCA) •  Encoding Categorical Variables •  Matrix Operations •  Matrix Factorization (SVD, Low Rank) •  Norms and Distance Functions •  Sparse Vectors Model Evaluation •  Cross Validation Predictive Analytics Library @MADlib_analytic
  • 14.
    14© Copyright 2016Pivotal. All rights reserved. Use Case: Predicting insurance claim amounts using structured and unstructured data Ÿ  Using features from structured and unstructured data sources associated with claims, build the capability to predict claim amounts
  • 15.
    15© Copyright 2016Pivotal. All rights reserved. Text analytics on MPP Ÿ  Unstructured data in the form of claim comments and claim descriptions (text) Ÿ  Use a bag-of-words approach (unigrams, bigrams) Ÿ  tf-idf for more meaningful insights
  • 16.
    16© Copyright 2016Pivotal. All rights reserved. Code walkthrough: Text analytics on MPP github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models We’ll walk through this Jupyter notebook
  • 17.
    17© Copyright 2016Pivotal. All rights reserved. Use Case: Churn prediction Ÿ  Build a churn model to predict which customers are most likely to churn Ÿ  Provide insights into key factors responsible for churn to potentially intervene prior to churn
  • 18.
    18© Copyright 2016Pivotal. All rights reserved. Usage Time Series Data Ÿ  Aggregate weekly usage by user Ÿ  Compute descriptive statistics Ÿ  Extract features based on business expertise
  • 19.
    19© Copyright 2016Pivotal. All rights reserved. Open Source Analytics Ecosystem Companies benefit from algorithmic breadth and scalability for building and socializing data science models MLlib PL/X Algorithms Visualization Best of breed in-memory and in-database tools for an MPP platform
  • 20.
    20© Copyright 2016Pivotal. All rights reserved. •  For embarrassingly parallel tasks, we can use procedural languages to easily parallelize any stand-alone library in Java, Python, R, pgSQL or C/C++ •  The interpreter/VM of the language ‘X’ is installed on each node of the MPP environment Standby Master … Master Host SQL Interconnect Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Segment Host Segment Segment Data Parallelism through PL/X : X in Python, R, Java, C/C++ and pgSQL •  plpython and python are loaded as dynamic libraries on the master and segment nodes (libpython.so and plpython.so are under $GPHOME/ext/python)
  • 21.
    21© Copyright 2016Pivotal. All rights reserved. User Defined Functions (UDFs) in PL/Python Ÿ  Procedural languages need to be installed on each database used. Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper. Alternatively like a SQL User Defined Function with Python inside. CREATE  FUNCTION  seasonality  (x  float[])      RETURNS  float[]   AS  $$      import  statsmodels.api  as  sm      s  =  sm.tsa.seasonal_decompose(x).seasonal        return  s   $$  LANGUAGE  plpythonu;   SQL wrapper SQL wrapper Normal Python
  • 22.
    22© Copyright 2016Pivotal. All rights reserved. Usage Time Series Data with PL/X Ÿ  Easily harness your UDF with open source libraries (for machine learning, signal processing...) Ÿ  Runs at scale through data parallelism
  • 23.
    23© Copyright 2016Pivotal. All rights reserved. Code walkthrough: Image processing on MPP github.com/gautamsm/data-science-on-mpp/tree/master/image_processing In-database Canny edge detection with OpenCV inside a PL/C function
  • 24.
    24© Copyright 2016Pivotal. All rights reserved. Pivotal Data Science Blogs 1.  Scaling native (C++) apps on Pivotal MPP 2.  Predicting commodity futures through Tweets 3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum 4.  Using data science to predict TV viewer behavior 5.  Twitter NLP: Scaling part-of-speech tagging 6.  Distributed deep learning on MPP and Hadoop 7.  Multi-variate time series forecasting 8.  Pivotal for good – Crisis Textline http://blog.pivotal.io/data-science-pivotal
  • 25.
    25© Copyright 2016Pivotal. All rights reserved. Thank You!
  • 26.
    A NEW PLATFORMFOR A NEW ERA