Best Practices for Big Data
Analytics with Machine
Learning

© 2013 Datameer, Inc. All rights reserved.
About our Speakers

Dr. Alex Guazzelli
Zementis Vice President, Analytics (@DrAlexGuazzelli)

Dr. Alex Guazzelli has co-authored the first book on PMML, the
Predictive Model Markup Language. At Zementis, Dr. Guazzelli is
responsible for developing core technology and analytical
solutions for Big Data and real-time scoring. Most recently, Dr.
Guazzelli started teaching a class on standards for predictive
analytics at UC San Diego Extension.
About our Speakers

Karen Hsu
Datameer Senior Director, Product Marketing (@Karenhsumar)

•  Over 15 years of enterprise software
experience

•  Co-authored 4 patents
•  Worked in a variety of engineering,
marketing and sales roles

•  Bachelors of Science degree in
Management Science and
Engineering from Stanford University

• 
• 
• 

Came from Infomatica
Worked with start-ups
Infomatica purchased to bring data
solutions to market
• 
Data quality
• 
Master data management 
• 
B2B
• 

Data security solutions
Agenda
•  Considerations
•  Best Practices
•  Demonstration
•  Q&A
Considerations

© 2013 Datameer, Inc. All rights reserved.
Considerations
Target Users
Business

IT 

Data 
Scientist

Questions

Descriptive! Predictive! Prescriptive!
Target Users
Business
Professional

▪  Visual

Dependencies
Clustering
Decision Trees

+ More!
Target Users
IT 

▪  Flexible, powerful
Target Users
Data 
Scientist

▪  Algorithms
▪  SAS, SPSS, R
Questions
Descriptive! Predictive! Prescriptive!

▪  Descriptive machine learning…
–  Tells you what has happened
Questions
Descriptive! Predictive! Prescriptive!

▪  Predictive machine learning…
–  Answers the question what will happen
Questions
Descriptive! Predictive! Prescriptive!

▪  Prescriptive machine learning…
–  What will happen, when it will happen, why
it will happen
–  Predict what will happen and prescribe how
to take advantage of this future
Best Practices

© 2013 Datameer, Inc. All rights reserved.
Lean Analytics

1. Integrate

Identify
Use Case

4. Visualize

2. Prepare
3. Analyze

Deploy
Union

Cleanse

Join

Bin

Normalize

Profile

Transform

Outliers

Missing Values

Invalid values

Data Preparation
Enrich
Descriptive Analytics

Drag & Drop Smart Analytics
Predictive Analytics
Predictive analytics is able to discover hidden patterns in historical data that the
human expert may not see. It is in fact the result of mathematics applied to data.
As such, it benefits from clever mathematical techniques as well as good data.

Predictive Analytics helps
you discover patterns in the
past, which can signal what
is ahead.

Descriptive vs. Predictive Analytics
" 
" 

Descriptive Analytics answers “What happened?”
Predictive Analytics answers “What will happen next?”

?
?
Example: Predicting Churn
Matt - Churned 2 days ago

Scott - “Liked” our company last week

John - ??
Churn-related features
Matt
3 complaints in last 6 months
Opened 2 support tickets in last 4 weeks
Spent a total of $1,234 buying merchandise
Spent a total of $123 in services
Purchased 2 items in last 4 weeks
Is 34 years old
Is a male
Lives in Los Angeles
...

Scott
No complaints in last 6 months
Opened 1 support ticket in last 4 weeks
Spent a total of $9,876 buying merchandise
Spent a total of $987 in services
Purchased 12 items in last 4 weeks
Is 54 years old
Is a male
Lives in Chicago
...
Big Data
An ever expanding ocean of data containing
people and sensor data (lots and lots of it):
" 
" 
" 
" 
" 
" 
" 

Transaction records
Social media
Climate information
Mobile GPS signals
Healthcare
Smart Grid
Digital Breadcrumbs

Breadth and Depth

90% of the data today
created in last 2 years
Churn-related “Big Data” features
Matt
12 friends listed as customers
2 complaints from friends in last 6 months
Average age of friends is 41 years old
2 friends churned in last 30 days
No purchases for same items as friends
1 website visit in last 7 days
2 website pages opened during last visit
Opened 3 newsletters in last 6 months
...

Scott
34 friends listed as customers
1 complaint from friends in last 6 months
Average age of friends is 62 years old
No friends churned in last 30 days
Purchased same 2 items as friends in last 2 months
3 website visits in last 7 days
5 website pages opened during last visit
Opened 12 newsletters in last 6 months
...
Building a predictive model ...
Model Training
Predictive
Model

Churned
Not-churned

Churn-related
features

Neural Networks
Linear/Logistic Regression
Support Vector Machines
Scorecards
Decision Trees
Clustering
Association Rules
K-Nearest Neighbors
Naive Bayes Classifiers
...

Input
Layer

Data

Hidden
Layer

Output
Layer

Prediction
Why not several models?
Model Ensemble
Model 1

Raw Inputs

Data PreProcessing

Model 2

Prediction

.
.
.
Model n

Scores from all
models are
computed

Voting

Majority Voting,
Weighted Voting,
Weighted Average,
etc.
End Goal: Predicting churn ...

Model Deployment and Execution in
Big Data
Predictive
Churn
Model
Churn-related
Features

Churn
Risk
Score
From Model Building to Model Deployment
(Traditionally ...)

SAS, R, IBM
SPSS, Perl,
Python

Scientist’s
Desktop

Java, .NET
C, SQL

Lost in
Translation

SAS, R, IBM SPSS …

Production
Environment

Great for model building
but not for scoring, even
more so when it comes to
Hadoop
From Model Building to Model Deployment (with PMML)
Model Deployment
and Execution

Model Building
" 

Angoss

" 

BigML

" 

FICO Model Builder

" 

IBM SPSS

" 

KNIME

" 

KXEN

" 

Microstrategy

" 

Open Data

" 

Pervasive DataRush

" 

RapidMiner

" 

R / Rattle

" 

SAS

" 

SAP Business Objects

" 

Salford Systems

" 

StatSoft STASTISTICA

" 

SQL Server

" 

TIBCO Spotfire

" 

Custom Code, etc.

Datameer Server
PMML	
  
PMML	
  
PMML	
  
(models)	
  
(models)	
  
(models)	
  

PMML
Deploy in minutes ...

	
  
	
  
	
  
	
  
	
  
	
  
	
  
	
  

Universal	
  PMML	
  
Plug-­‐in	
  (UPPI)	
  
Predictive Model Markup Language
"   PMML is an XML-based language used to define statistical and data mining

models and to share these between compliant applications.
"   It is a mature standard developed by the DMG (Data Mining Group) to avoid

proprietary issues and incompatibilities and to deploy models.
"   PMML eliminates need for custom model deployment and ensures reliability.

Models

Data
Transformations

PMML defines a standard not only to represent data-mining
models, but also data handling and data transformations
(pre- and post-processing)
UPPI: Supported Techniques
"   Neural Networks (neural gas, radial-basis and backpropagation)
"   Support Vector Machines (for classification and regression)
"   Naive Bayes Classifier (for continuous and categorical inputs)
"   Rule Set Models
"   Clustering Models (2-step clustering, distribution and center-based)
"   Decision Trees (for classification and regression)
"   General Regression Models (Cox, General and Generalized Linear Models)
"   Regression Models (Linear, Logistic and Polynomial Regression Models)
"   Scorecards (with support for Reason Codes)
"   Restricted Boltzmann Machines
"   Association Rules
"   Multiple Models (with the possibility of having models spread over multiple PMML

files)
"   Model Ensemble (including Random Forest Models and Boosted Trees)
"   Model Segmentation
"   Model Chaining
"   Model Composition
"   Model Cascade

© Zementis, Inc. - Confidential
Demonstration Flow

Descriptive

Karen

Predictive
Modeling

Alex

Predictive
Production

Prescriptive

Karen

Karen
Descriptive Analytics

© 2013 Datameer, Inc. All rights reserved.
Descriptive Analytics
▪  Answers: What caused people to churn?
▪  Clustering
▪  Column Dependencies
▪  Decision Tree
Demonstration Flow

Descriptive

Karen

Predictive
Modeling

Alex

Predictive
Production

Prescriptive

Karen

Karen
Predictive Analytics

© 2013 Datameer, Inc. All rights reserved.
Predictive Analytics
▪  Who will churn?
Demonstration Flow

Descriptive

Karen

Predictive
Modeling

Alex

Predictive
Production

Prescriptive

Karen

Karen
Prescriptive Analytics

© 2013 Datameer, Inc. All rights reserved.
Prescriptive Analytics
▪  Who will churn? Why will they churn?
▪  What can we do to support that outcome?
Demonstration Flow

Descriptive

Karen

Predictive
Modeling

Alex

Predictive
Production

Prescriptive

Karen

Karen
Q&A
Next Steps:
More about Datameer and Big Data
www.datameer.com

More about Zementis
www.zementis.com

Contact us:
Alex Guazzeli aguazzeli@zementis.com 

Karen Hsu khsu@datameer.com 

Page 40

Best Practices for Big Data Analytics with Machine Learning by Datameer

  • 1.
    Best Practices forBig Data Analytics with Machine Learning © 2013 Datameer, Inc. All rights reserved.
  • 2.
    About our Speakers Dr.Alex Guazzelli Zementis Vice President, Analytics (@DrAlexGuazzelli) Dr. Alex Guazzelli has co-authored the first book on PMML, the Predictive Model Markup Language. At Zementis, Dr. Guazzelli is responsible for developing core technology and analytical solutions for Big Data and real-time scoring. Most recently, Dr. Guazzelli started teaching a class on standards for predictive analytics at UC San Diego Extension.
  • 3.
    About our Speakers KarenHsu Datameer Senior Director, Product Marketing (@Karenhsumar) •  Over 15 years of enterprise software experience •  Co-authored 4 patents •  Worked in a variety of engineering, marketing and sales roles •  Bachelors of Science degree in Management Science and Engineering from Stanford University •  •  •  Came from Infomatica Worked with start-ups Infomatica purchased to bring data solutions to market •  Data quality •  Master data management •  B2B •  Data security solutions
  • 4.
    Agenda •  Considerations •  BestPractices •  Demonstration •  Q&A
  • 5.
    Considerations © 2013 Datameer,Inc. All rights reserved.
  • 6.
    Considerations Target Users Business IT Data Scientist Questions Descriptive! Predictive! Prescriptive!
  • 7.
  • 8.
    Target Users IT ▪ Flexible, powerful
  • 9.
    Target Users Data Scientist ▪ Algorithms ▪  SAS, SPSS, R
  • 10.
    Questions Descriptive! Predictive! Prescriptive! ▪ Descriptive machine learning… –  Tells you what has happened
  • 11.
    Questions Descriptive! Predictive! Prescriptive! ▪ Predictive machine learning… –  Answers the question what will happen
  • 12.
    Questions Descriptive! Predictive! Prescriptive! ▪ Prescriptive machine learning… –  What will happen, when it will happen, why it will happen –  Predict what will happen and prescribe how to take advantage of this future
  • 13.
    Best Practices © 2013Datameer, Inc. All rights reserved.
  • 14.
    Lean Analytics 1. Integrate Identify UseCase 4. Visualize 2. Prepare 3. Analyze Deploy
  • 15.
  • 16.
    Descriptive Analytics Drag &Drop Smart Analytics
  • 17.
    Predictive Analytics Predictive analyticsis able to discover hidden patterns in historical data that the human expert may not see. It is in fact the result of mathematics applied to data. As such, it benefits from clever mathematical techniques as well as good data. Predictive Analytics helps you discover patterns in the past, which can signal what is ahead. Descriptive vs. Predictive Analytics "  "  Descriptive Analytics answers “What happened?” Predictive Analytics answers “What will happen next?” ? ?
  • 18.
    Example: Predicting Churn Matt- Churned 2 days ago Scott - “Liked” our company last week John - ??
  • 19.
    Churn-related features Matt 3 complaintsin last 6 months Opened 2 support tickets in last 4 weeks Spent a total of $1,234 buying merchandise Spent a total of $123 in services Purchased 2 items in last 4 weeks Is 34 years old Is a male Lives in Los Angeles ... Scott No complaints in last 6 months Opened 1 support ticket in last 4 weeks Spent a total of $9,876 buying merchandise Spent a total of $987 in services Purchased 12 items in last 4 weeks Is 54 years old Is a male Lives in Chicago ...
  • 20.
    Big Data An everexpanding ocean of data containing people and sensor data (lots and lots of it): "  "  "  "  "  "  "  Transaction records Social media Climate information Mobile GPS signals Healthcare Smart Grid Digital Breadcrumbs Breadth and Depth 90% of the data today created in last 2 years
  • 21.
    Churn-related “Big Data”features Matt 12 friends listed as customers 2 complaints from friends in last 6 months Average age of friends is 41 years old 2 friends churned in last 30 days No purchases for same items as friends 1 website visit in last 7 days 2 website pages opened during last visit Opened 3 newsletters in last 6 months ... Scott 34 friends listed as customers 1 complaint from friends in last 6 months Average age of friends is 62 years old No friends churned in last 30 days Purchased same 2 items as friends in last 2 months 3 website visits in last 7 days 5 website pages opened during last visit Opened 12 newsletters in last 6 months ...
  • 22.
    Building a predictivemodel ... Model Training Predictive Model Churned Not-churned Churn-related features Neural Networks Linear/Logistic Regression Support Vector Machines Scorecards Decision Trees Clustering Association Rules K-Nearest Neighbors Naive Bayes Classifiers ... Input Layer Data Hidden Layer Output Layer Prediction
  • 23.
    Why not severalmodels? Model Ensemble Model 1 Raw Inputs Data PreProcessing Model 2 Prediction . . . Model n Scores from all models are computed Voting Majority Voting, Weighted Voting, Weighted Average, etc.
  • 24.
    End Goal: Predictingchurn ... Model Deployment and Execution in Big Data Predictive Churn Model Churn-related Features Churn Risk Score
  • 25.
    From Model Buildingto Model Deployment (Traditionally ...) SAS, R, IBM SPSS, Perl, Python Scientist’s Desktop Java, .NET C, SQL Lost in Translation SAS, R, IBM SPSS … Production Environment Great for model building but not for scoring, even more so when it comes to Hadoop
  • 26.
    From Model Buildingto Model Deployment (with PMML) Model Deployment and Execution Model Building "  Angoss "  BigML "  FICO Model Builder "  IBM SPSS "  KNIME "  KXEN "  Microstrategy "  Open Data "  Pervasive DataRush "  RapidMiner "  R / Rattle "  SAS "  SAP Business Objects "  Salford Systems "  StatSoft STASTISTICA "  SQL Server "  TIBCO Spotfire "  Custom Code, etc. Datameer Server PMML   PMML   PMML   (models)   (models)   (models)   PMML Deploy in minutes ...                 Universal  PMML   Plug-­‐in  (UPPI)  
  • 27.
    Predictive Model MarkupLanguage "   PMML is an XML-based language used to define statistical and data mining models and to share these between compliant applications. "   It is a mature standard developed by the DMG (Data Mining Group) to avoid proprietary issues and incompatibilities and to deploy models. "   PMML eliminates need for custom model deployment and ensures reliability. Models Data Transformations PMML defines a standard not only to represent data-mining models, but also data handling and data transformations (pre- and post-processing)
  • 28.
    UPPI: Supported Techniques "  Neural Networks (neural gas, radial-basis and backpropagation) "   Support Vector Machines (for classification and regression) "   Naive Bayes Classifier (for continuous and categorical inputs) "   Rule Set Models "   Clustering Models (2-step clustering, distribution and center-based) "   Decision Trees (for classification and regression) "   General Regression Models (Cox, General and Generalized Linear Models) "   Regression Models (Linear, Logistic and Polynomial Regression Models) "   Scorecards (with support for Reason Codes) "   Restricted Boltzmann Machines "   Association Rules "   Multiple Models (with the possibility of having models spread over multiple PMML files) "   Model Ensemble (including Random Forest Models and Boosted Trees) "   Model Segmentation "   Model Chaining "   Model Composition "   Model Cascade © Zementis, Inc. - Confidential
  • 29.
  • 30.
    Descriptive Analytics © 2013Datameer, Inc. All rights reserved.
  • 31.
    Descriptive Analytics ▪  Answers:What caused people to churn? ▪  Clustering ▪  Column Dependencies ▪  Decision Tree
  • 32.
  • 33.
    Predictive Analytics © 2013Datameer, Inc. All rights reserved.
  • 34.
  • 35.
  • 36.
    Prescriptive Analytics © 2013Datameer, Inc. All rights reserved.
  • 37.
    Prescriptive Analytics ▪  Whowill churn? Why will they churn? ▪  What can we do to support that outcome?
  • 38.
  • 39.
  • 40.
    Next Steps: More aboutDatameer and Big Data www.datameer.com More about Zementis www.zementis.com Contact us: Alex Guazzeli aguazzeli@zementis.com Karen Hsu khsu@datameer.com Page 40