Create a Data Science Lab with
Microsoft and Open Source Tools
Marcel Franke, pmOne AG, Germany
About me – Marcel Franke
Practice Lead Advanced Analytics & Data Science
pmOne AG – Germany, Austria, Switzerland
>10 years experiences with large scale
Data Warehouses based on SQL Server
Blog: dwjunkie.wordpress.com
What is data science?
The Definition
Data science incorporates varying
elements and builds on techniques and
theories from many fields, including
mathematics, statistics, data engineering,
pattern recognition and learning, advanced
computing, visualization, uncertainty
modeling, data warehousing, and high
performance computing with the goal of
extracting meaning from data and
creating data products.

Source: http://en.wikipedia.org/wiki/Data_science
A brief look into history
GAMBLING –
THAT’S WHERE
EVERYTHING
STARTED
The beginnings of gambling
Gambling exists since 3000 BC
First games based on dices

Origin in China and Mesopotamian
* Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
Scientific foundations
17th century Paradox of
Chevaliers de Méré
LaPlace und Fermat discussed
the paradox in several letters
The beginning of theory of
probability
* Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
The science in Data Science
Calculate probabilities
Pattern recognition
Calculation of analytical variance
Machine Learning
Simulations
Predictions
BI, Data Mining & Prediction
WEATHER
FORECAST
What do companies do today?
Walmart – The pioneer of data analytics

Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
Visa

80% correct prediction of divorces
within the next 5 years
Reason: Divorce is the highest risk
for private insolvency
Source: visa.de
Customers need to find the right case

What do consumers
really do?
Blonde looks
somehow different 

The new washing powder is really great…
Data can be accessed easily…
… but, it‘s hard to analyze it.
Other areas of application
SOCIAL
MEDIA

PRODUCT REMOMMENDATION
RETARGETING

PREDICTIVE
MAINTENANCE

PREDICT RISKS

areas of
application
SALES PREDICTIONS

CUSTOMER ANLYSIS

DYNAMIC PRICING

DISPOSITION
How does this fit to Big Data?
Our starting point…
Structured data

Unstructured data

Harmonize and
generate Information
(Role of „Data Scientist“)

„BIG Data“
Volume, Variety, Velocity
Typical Big Data Architecture
Big Data Analytics

Excel

Big Data Advanced Analytics

PowerPivot
Big Data Preparation (SQL, Map Reduce)

Unstructured data

Structured data
Massive Parallel Processing

Big Data Storage Platform
“[Facebook] started in the Hadoop world. We are now bringing in
relational to enhance that. We're kind of going [in] the other
direction.”
“We've been there, and [we] realized that using the wrong
technology for certain kinds of problems can be difficult. We
started at the end and we're working our way backwards, bringing
in both.”
Ken Rudin,
Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1

Director of Analytics for Facebook
Some word to „R“
• R is a language and environment for statistical
computing and graphics
• R is Open Source under GNU general public license
• Most widely used statistical software
• Everything happens in-memory
• Comes with a package manager (~5000 packages)
• Provides also graphical functionalities
Samples of R
How to approach projects?
Starting Point
Problems, which we know from the BI world already, are further exacerbated by
big data.

•

Complexity of systems constantly grows

•

Amount of data growth exponentially (= Big Data)

•

Need for change is more frequent and is increasingly delving deeper into
business rules

•

Solutions can no longer be thought ahead
Solution Option 1 – Classic Deterministic

Everything can be planned and
design at the drawing board…
How does a system with products & components and their
relationships behaves with each other?

Quelle: Cesar Hidalgo
Solution Option 2 – Learn from „mother Nature“
• How does nature deal with complex non-linear systems?
• Evolution – Variation and selection – „Trial and Error“

„It is not the strongest of the species that
survives, nor the most intelligent but the one
most responsive to change.“ (Charles Darwin)
A candlestick?
45 Iterations

Technology helps, to speed iterations.
Laboratory & Factory
The laboratory

Try & Error
Pattern Recognition
Analytical Apps
An efficient laboratory to experiment
Power Pivot
In-Memory

Microsoft Excel

Power View

Unstructured
Data

Power Query

Source Systems

Power Map

SQL Server

Structured
Data
OleD
B
Odata

WebServer-Logs
Sensor-Data

Data Marketplace

SAP

Databases
Easy to cosume

The factory
Integrated in the business process

Analyze on mass data

Host it and run it

At Enterpise Scale
For Realtime Enterprise
Stable Big Data Architecture
Prediction &
Data Science

Front-Ends &
Mobile
Windows
Azure

On-Premises

Source Systems

Unstructured
Data

WebServer-Logs
Sensor-Data

HDInsight

SQL Server PDW

Data Marketplace

Structured
Data

SAP

Databases
How do we scale?
The battle
How do we scale?
Relational data & compute

SQL Server 2012
Parallel Data
Warehouse
Half Rack

Infiniband

Analytical data &
compute

HP DL 385
40 Cores
2 TB RAM
Fusion-IO Card
What is Revolution Analytics?
• Founded in 2007
• Aim: Evolution of R for high-performance
• Offer R packages for faster performance and
greater stability
• Enterprise & Community products
• Stand-alone, Scale-out (HPC), on Hadoop
How do we handle our data?
R-ODBC: 10 MB/s

Flat file export: 80 MB/s

Data preparation

Data transfer

predictive scripts
Results
• Generate predictions for 30.000 customers
–
–
–
–

•
•
•
•

50.000 rows per customer, 54 columns
Customer goal: 5 Minutes
Our solution: 7.500 customers in 5 Minutes
Benchmark: 1 Minute

Revolution Analytics ODBC driver does not work with PDW
Standard R ODBC driver reads data with 10 MB/s
Workaround via flat file export
RDS format faster than csv
Other solutions?
• R in database
• R on Hadoop
– RHadoop
– Revolution Analytics RHadoop
Other solutions?
• Services & Cloud
THANK YOU!
• For attending this session and
PASS SQLRally Nordic 2013, Stockholm
Titles are set to 34 pt, Arial
Click to edit Master title style
• Level 1 text is 28 pt Arial
– Level 2 text is 24 pt Arial
• Level 3 text is 20 pt Arial
– Level 4 text is 20 pt Arial
• Level 5 text is 20 pt Arial
Notes (hidden)
• Some speakers may use this slide for hidden
notes
• Please delete if you prefer not to use
• Please note you are also able to use notes
section for each slide

Create a Data Science Lab with Microsoft and Open Source tools

  • 2.
    Create a DataScience Lab with Microsoft and Open Source Tools Marcel Franke, pmOne AG, Germany
  • 3.
    About me –Marcel Franke Practice Lead Advanced Analytics & Data Science pmOne AG – Germany, Austria, Switzerland >10 years experiences with large scale Data Warehouses based on SQL Server Blog: dwjunkie.wordpress.com
  • 4.
    What is datascience?
  • 5.
    The Definition Data scienceincorporates varying elements and builds on techniques and theories from many fields, including mathematics, statistics, data engineering, pattern recognition and learning, advanced computing, visualization, uncertainty modeling, data warehousing, and high performance computing with the goal of extracting meaning from data and creating data products. Source: http://en.wikipedia.org/wiki/Data_science
  • 6.
    A brief lookinto history
  • 7.
  • 8.
    The beginnings ofgambling Gambling exists since 3000 BC First games based on dices Origin in China and Mesopotamian * Source: Tiemeyer, E.; Zsifkovitis, H.: Information als Führungsmittel, München: Computerwoche Verlag 1995
  • 9.
    Scientific foundations 17th centuryParadox of Chevaliers de Méré LaPlace und Fermat discussed the paradox in several letters The beginning of theory of probability * Source: http://de.wikipedia.org/wiki/De-M%C3%A9r%C3%A9-Paradoxon
  • 10.
    The science inData Science Calculate probabilities Pattern recognition Calculation of analytical variance Machine Learning Simulations Predictions
  • 11.
    BI, Data Mining& Prediction
  • 12.
  • 13.
  • 14.
    Walmart – Thepioneer of data analytics Source: Data Unser – Dr. Bloching, Bilder: walmart.com, yourdealz.de, squidoo.com, fuzzybrew.com
  • 15.
    Visa 80% correct predictionof divorces within the next 5 years Reason: Divorce is the highest risk for private insolvency Source: visa.de
  • 16.
    Customers need tofind the right case What do consumers really do? Blonde looks somehow different  The new washing powder is really great…
  • 17.
    Data can beaccessed easily…
  • 18.
    … but, it‘shard to analyze it.
  • 19.
    Other areas ofapplication SOCIAL MEDIA PRODUCT REMOMMENDATION RETARGETING PREDICTIVE MAINTENANCE PREDICT RISKS areas of application SALES PREDICTIONS CUSTOMER ANLYSIS DYNAMIC PRICING DISPOSITION
  • 20.
    How does thisfit to Big Data?
  • 21.
    Our starting point… Structureddata Unstructured data Harmonize and generate Information (Role of „Data Scientist“) „BIG Data“ Volume, Variety, Velocity
  • 22.
    Typical Big DataArchitecture Big Data Analytics Excel Big Data Advanced Analytics PowerPivot Big Data Preparation (SQL, Map Reduce) Unstructured data Structured data Massive Parallel Processing Big Data Storage Platform
  • 23.
    “[Facebook] started inthe Hadoop world. We are now bringing in relational to enhance that. We're kind of going [in] the other direction.” “We've been there, and [we] realized that using the wrong technology for certain kinds of problems can be difficult. We started at the end and we're working our way backwards, bringing in both.” Ken Rudin, Source: http://tdwi.org/articles/2013/05/06/facebooks-relationalplatform.aspx?j=192038&e=marcel.franke@pmone.com&l=50_HTML&u=3967541&mid=1060748&jb=84&m=1 Director of Analytics for Facebook
  • 24.
    Some word to„R“ • R is a language and environment for statistical computing and graphics • R is Open Source under GNU general public license • Most widely used statistical software • Everything happens in-memory • Comes with a package manager (~5000 packages) • Provides also graphical functionalities
  • 25.
  • 26.
    How to approachprojects?
  • 27.
    Starting Point Problems, whichwe know from the BI world already, are further exacerbated by big data. • Complexity of systems constantly grows • Amount of data growth exponentially (= Big Data) • Need for change is more frequent and is increasingly delving deeper into business rules • Solutions can no longer be thought ahead
  • 28.
    Solution Option 1– Classic Deterministic Everything can be planned and design at the drawing board…
  • 29.
    How does asystem with products & components and their relationships behaves with each other? Quelle: Cesar Hidalgo
  • 30.
    Solution Option 2– Learn from „mother Nature“ • How does nature deal with complex non-linear systems? • Evolution – Variation and selection – „Trial and Error“ „It is not the strongest of the species that survives, nor the most intelligent but the one most responsive to change.“ (Charles Darwin)
  • 31.
  • 32.
    45 Iterations Technology helps,to speed iterations.
  • 33.
  • 34.
    The laboratory Try &Error Pattern Recognition Analytical Apps
  • 35.
    An efficient laboratoryto experiment Power Pivot In-Memory Microsoft Excel Power View Unstructured Data Power Query Source Systems Power Map SQL Server Structured Data OleD B Odata WebServer-Logs Sensor-Data Data Marketplace SAP Databases
  • 37.
    Easy to cosume Thefactory Integrated in the business process Analyze on mass data Host it and run it At Enterpise Scale For Realtime Enterprise
  • 38.
    Stable Big DataArchitecture Prediction & Data Science Front-Ends & Mobile Windows Azure On-Premises Source Systems Unstructured Data WebServer-Logs Sensor-Data HDInsight SQL Server PDW Data Marketplace Structured Data SAP Databases
  • 40.
    How do wescale?
  • 41.
  • 42.
    How do wescale? Relational data & compute SQL Server 2012 Parallel Data Warehouse Half Rack Infiniband Analytical data & compute HP DL 385 40 Cores 2 TB RAM Fusion-IO Card
  • 43.
    What is RevolutionAnalytics? • Founded in 2007 • Aim: Evolution of R for high-performance • Offer R packages for faster performance and greater stability • Enterprise & Community products • Stand-alone, Scale-out (HPC), on Hadoop
  • 44.
    How do wehandle our data? R-ODBC: 10 MB/s Flat file export: 80 MB/s Data preparation Data transfer predictive scripts
  • 45.
    Results • Generate predictionsfor 30.000 customers – – – – • • • • 50.000 rows per customer, 54 columns Customer goal: 5 Minutes Our solution: 7.500 customers in 5 Minutes Benchmark: 1 Minute Revolution Analytics ODBC driver does not work with PDW Standard R ODBC driver reads data with 10 MB/s Workaround via flat file export RDS format faster than csv
  • 46.
    Other solutions? • Rin database • R on Hadoop – RHadoop – Revolution Analytics RHadoop
  • 47.
  • 48.
    THANK YOU! • Forattending this session and PASS SQLRally Nordic 2013, Stockholm
  • 49.
    Titles are setto 34 pt, Arial Click to edit Master title style • Level 1 text is 28 pt Arial – Level 2 text is 24 pt Arial • Level 3 text is 20 pt Arial – Level 4 text is 20 pt Arial • Level 5 text is 20 pt Arial
  • 50.
    Notes (hidden) • Somespeakers may use this slide for hidden notes • Please delete if you prefer not to use • Please note you are also able to use notes section for each slide

Editor's Notes

  • #6 A lotoftopicsandskillsarecombinedData Warehouse is also a partofitMore Statisticsandmathematicskillsareneeded
  • #7 Wheredoes Data Science comefrom?
  • #8 Whenyou do someresearch on thattopicyou will automaticallystumbleaboutgamblingorgamesofchances.
  • #9 Dicecup
  • #10 2 scientistsstartedthinkingaboutgamling on a morescientificway.Writing verylongletters back andforthDifferentprobabilitytowinifyouplaywith 1 diceor 2
  • #11 1.)Howbigistheprobabilitytowinorloose, ortoreach a certaingoal?2.) Isthereanycorrelationbetweenthecustomerincomeandthesalesamount?5.) Whathappensifwechangecertainparameterslikeprice?6.) Whatisthesalesamoutof a certainproduct in thenextquarteroryear?
  • #12 Howdoesthistopic fit to BI?
  • #13 Whatcan I do withit?
  • #14 So what do companies do withit?I consciouslydidn‘tusetheword Big Data but you all knowthatthisnewareaisveryhot in marketingandnews. So whatarethegoodexamples & usecases?
  • #15 Kasse – cash deskBelohnung – rewardWindel - nappy
  • #23 Stellwert von R herausheben -> fast alle Anbieter basieren auf RWir viel im Bereich Open Source verwendet
  • #32 InjectorforwashingpelletsWaste, poorquality,
  • #36 Ideaof a processmodellcalled Lab & FactoryExperimental approachIterativeFastFind newpatterns
  • #37 Isforthedatascientisttoexperiment
  • #40 Ifwefoundsomethinginteresting, wecandeployittothefactoryIt‘stheplacewherewerunouranalyticalcode at Enterprise scale
  • #43 Mostoftheanalyticaltoolsare out thereforyearslike Databases, R, SAS, SPSSWeoftenherelimitations in scalability & performanceDB -> MPPR, SAS, -> In-Memory
  • #44 POC on different analyticusecaseswiththebigvendorsComplex SQL-QueriesSimulationsPredictionswith R
  • #45 SQL -> wir wissen wie wir skalierenR -> Skalierung schwierig, deshalb Revolution
  • #49 Kein stabiler Markt, viele Möglichkeiten