Data Science at Scale on MPP databases - Use Cases & Open Source Tools

1© Copyright 2016 Pivotal. All rights reserved. 1© Copyright 2016 Pivotal. All rights reserved.
Esther Vasiete
Pivotal Data Scientist
Structure Data 2016
Data Science at Scale on MPP
Databases – Use Cases & Open Source
Tools
Joint work with Pivotal Data Science

2© Copyright 2016 Pivotal. All rights reserved.
Agenda
Ÿ  Introduction
Ÿ  Open Source Data Science Toolkit
Ÿ  Real world applications
–  Predictive maintenance of automobiles
–  Predicting insurance claims
–  Predicting customer churn
Ÿ  Data science deep-dive with Jupyter notebooks
–  Text analytics on MPP (github.com/vatsan)
–  Image processing on MPP (github.com/gautamsm)

Pivotal Data Science
Our Charter:
Pivotal Data Science is Pivotal’s differentiated and
highly opinionated data-centric service delivery
organization (part of Pivotal Labs)
Our Goals:
Expedite customer time-to-value and ROI, by driving
business-aligned innovation and solutions assurance
within Pivotal’s Data Fabric technologies.
Drive customer adoption and autonomy across the full
spectrum of Pivotal Data technologies through best-in-
class data science and data engineering services, with
a deep emphasis on knowledge transfer.
Data Science Data Engineering
App Dev

Pivotal Data Science Knowledge Development

Use Case: Preventive Maintenance for
Connected Vehicles
Ÿ  Customer vehicles transmit Diagnostic Trouble Codes (DTC)
and vehicle status data to the Pivotal analytics environment
Ÿ  Can the DTC data be leveraged to predict the presence of
potential problems in vehicles?
Ÿ  Set up a data science framework on the Pivotal analytics
environment that would enable the customer data science
team to continuously monitor problems in their vehicles
using DTC data

Problem Setup – Predicting Job Type from
Diagnostic Trouble Codes (DTCs)
Time
Job Type:
Transmission
Job Type:
Transmission
Engine
Job Type:
Body
DTC: B DTC:
B,
P, C
DTC: U
DTC: B DTC: B
DTC:
B, P, C, U
DTC:
P, B, U
DTC: P DTC: B DTC:
B,P
DTC:
B,P
Can the DTCs
observed here predict
this Job Type?
Can the DTCs observed
here predict this Job
Type?
Can the DTCs observed
here predict this Job
Type?

Data Parallelism
One or more job on the same day
Multi-labeling problem
One-vs-rest classifiers
built in parallel
1
0
0
1
0 1
0
Class 1
Class 2
Class 3
One-vs-Rest Classification
Red vs.
Non Red
On Segment 1
Green vs.
Non Green
On Segment 2
Blue vs.
Non Blue
On Segment N

Model Scoring Pipeline
DTC: B DTC: B, P, C DTC: U
Body
Axle
Engine
Prob >=
Threshold
Prob >=
Threshold
Prob >=
Threshold
Model Caching
(GPDB/
HAWQ)
Real time
scoring
web or mobile app dashboard
Ingest
Sink

MPP Architectural Overview
Think of it as multiple
PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by
a particular field (or randomly)

IT TAKES MORE THAN
ONE TOOL

Open Source Data Science Toolkit
KEY LANGUAGES
P L A T F O R M
KEY TOOLS
MLlib
PL/X
Pivotal Big Data Suite
ModelingTools
VisualizationTools
Platform
GemFire

Scalable, In-Database
Machine Learning
•  Open Source https://github.com/madlib/madlib
•  Works on Greenplum DB, Apache HAWQ and PostgreSQL
•  In active development by Pivotal
•  MADlib is now an Apache Software Foundation incubator project!
Apache (incubating)

Functions
Supervised Learning
Regression Models
•  Cox Proportional Hazards Regression
•  Elastic Net Regularization
•  Generalized Linear Models
•  Linear Regression
•  Logistic Regression
•  Marginal Effects
•  Multinomial Regression
•  Ordinal Regression
•  Robust Variance, Clustered Variance
•  Support Vector Machines
Tree Methods
•  Decision Tree
•  Random Forest
Other Methods
•  Conditional Random Field
•  Naïve Bayes
Unsupervised Learning
•  Association Rules (Apriori)
•  Clustering (K-means)
•  Topic Modeling (LDA)
Statistics
Descriptive
•  Cardinality Estimators
•  Correlation
•  Summary
Inferential
•  Hypothesis Tests
Other Statistics
•  Probability Functions
Other Modules
•  Conjugate Gradient
•  Linear Solvers
•  PMML Export
•  Random Sampling
•  Term Frequency for Text
Time Series
•  ARIMA
Aug 2015
Data Types and Transformations
•  Array Operations
•  Dimensionality Reduction (PCA)
•  Encoding Categorical Variables
•  Matrix Operations
•  Matrix Factorization (SVD, Low Rank)
•  Norms and Distance Functions
•  Sparse Vectors
Model Evaluation
•  Cross Validation
Predictive Analytics Library
@MADlib_analytic

Use Case: Predicting insurance claim amounts
using structured and unstructured data
Ÿ  Using features from structured and unstructured data
sources associated with claims, build the capability to
predict claim amounts

Text analytics on MPP
Ÿ  Unstructured data in the
form of claim comments and
claim descriptions (text)
Ÿ  Use a bag-of-words
approach (unigrams,
bigrams)
Ÿ  tf-idf for more meaningful
insights

Code walkthrough: Text analytics on MPP
github.com/vatsan/text_analytics_on_mpp/tree/master/vector_space_models
We’ll walk through
this Jupyter
notebook

Use Case: Churn prediction
Ÿ  Build a churn model to predict
which customers are most likely
to churn
Ÿ  Provide insights into key factors
responsible for churn to
potentially intervene prior to
churn

Usage Time Series Data
Ÿ  Aggregate weekly usage by user
Ÿ  Compute descriptive statistics
Ÿ  Extract features based on business expertise

Open Source Analytics Ecosystem
Companies benefit from algorithmic breadth and scalability for
building and socializing data science models
MLlib
PL/X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform

•  For embarrassingly parallel
tasks, we can use procedural
languages to easily
parallelize any stand-alone
library in Java, Python, R,
pgSQL or C/C++
•  The interpreter/VM of the
language ‘X’ is installed on
each node of the MPP
environment
Standby
Master
…
Master
Host
SQL
Interconnect
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Segment Host
Segment
Segment
Data Parallelism through PL/X : X in Python, R, Java,
C/C++ and pgSQL
•  plpython and python are loaded as dynamic
libraries on the master and segment nodes
(libpython.so and plpython.so are under
$GPHOME/ext/python)

User Defined Functions (UDFs) in PL/Python
Ÿ  Procedural languages need to be installed on each database used.
Ÿ  Syntax is like normal Python function with function definition line replaced by SQL wrapper.
Alternatively like a SQL User Defined Function with Python inside.
CREATE
FUNCTION
seasonality
(x
float[])

RETURNS
float[]

AS
$$

import
statsmodels.api
as
sm

s
=
sm.tsa.seasonal_decompose(x).seasonal

return
s

$$
LANGUAGE
plpythonu;

SQL wrapper
SQL wrapper
Normal Python

Usage Time Series Data with PL/X
Ÿ  Easily harness your UDF with open source libraries (for machine learning,
signal processing...)
Ÿ  Runs at scale through data parallelism

Code walkthrough: Image processing on MPP
github.com/gautamsm/data-science-on-mpp/tree/master/image_processing
In-database Canny edge detection with OpenCV
inside a PL/C function

Pivotal Data Science Blogs
1.  Scaling native (C++) apps on Pivotal MPP
2.  Predicting commodity futures through Tweets
3.  A pipeline for distributed topic & sentiment analysis of tweets on Greenplum
4.  Using data science to predict TV viewer behavior
5.  Twitter NLP: Scaling part-of-speech tagging
6.  Distributed deep learning on MPP and Hadoop
7.  Multi-variate time series forecasting
8.  Pivotal for good – Crisis Textline
http://blog.pivotal.io/data-science-pivotal

Thank You!

Data Science at Scale on MPP databases - Use Cases & Open Source Tools

More Related Content

What's hot

Similar to Data Science at Scale on MPP databases - Use Cases & Open Source Tools

Recently uploaded

Data Science at Scale on MPP databases - Use Cases & Open Source Tools