DATA SCIENCE
DATA ENGINEERS
DATA SOLUTIONS
Think&Big&Start&Smart&Scale&Fast
Eliano Marques-– Senior-Data-Scientist
Martin-Oberhuber-– Senior-Data-Scientist
CONFIDENTIAL+ +++++++| 2© 2015 Think Big, a Teradata Company
Think+Big+History
1st
SI+Solution+Provider+with+100%+focus+on+open+source+
and+Big+Data+Hadoop ecosystem
• 100++Successful+Programs
• 70++Clients
• Global+Delivery+Capabilities
• We-are-hiring
CONFIDENTIAL+ +++++++| 3© 2015 Think Big, a Teradata Company
Think-Big-Clients
Trusted&Analytics&Services&Provider&to&the&Fortune&1000
eCommerce
2+of+Global+Top+5
Internet-Transaction-Security
Global #1
Retail
2+of+Global+Top+5
Brokerage &-Mutual-Funds
2+of+Global+Top+5
Social-Networking
Global #1
Asset-Management
Global #1
Credit-Issuer
2+of Global+Top+5
Semiconductor
2+of+Global Top+5
Banking
4+of+Global Top+10
Data Storage-Devices
3+of+Global Top+5
Financial Data-Services
2+of+Global+Top+5
Disk Manufacturing
Global+#1
Financial-Exchanges
Global #2
Telecommunications
2+of+Global Top+5
Media-& Advertising
2+of+Global+Top+5
CONFIDENTIAL+ +++++++| 4© 2015 Think Big, a Teradata Company
Think+Big+VELOCITY Methodology
Big+Data
Strategy
Think+Big
Academy
Big+Data
Program+Mgt
Business
Analytics
Managed+
Services
Data+
Engineering
Big+Data+Lab
Think+Big+engages+with+it’s+client’s+business,+technical,+analyst+and+support+teams+in+
an+agile+inspired+VELOCITY+Methodology+to+continuously+develop+Big+Data+solutions+
CONFIDENTIAL+ +++++++| 5© 2015 Think Big, a Teradata Company
What+is+Apache+Spark?+
• Open+source+Apache+project
− Parallel+middleware+for+server+
clusters
− Spark.apache.org+(2014)
• Developed+by+UC+Berkeley’s+
AMPLab
− Supported+by+Databricks
• Top+use+cases
− SQLaonaHadoop
− Machine+learning
− Streaming+data+miniabatches
CONFIDENTIAL+ +++++++| 6© 2015 Think Big, a Teradata Company
Apache-Spark-Core-Engine
Spark-SQL
Spark-
Streaming
MLib
(Machine-learning)
GraphX
(Graph)
Scala,-R-(SparkR),-Python-(PySpark)
What+is+Apache+Spark?+
CONFIDENTIAL+ +++++++| 7© 2015 Think Big, a Teradata Company
Data+Science+Approaches
7
Single-Workstation
- Small+data+sets
- No+distributed+analytics+
across+multiple+nodes
- Powerful+tools+are+R+or+
Python
- Data+Scientist+can+focus+on+
business+problem
Mixed
Single/Workstation/+/Cluster
- Small+or+large+data+sets
- Data+wrangling+and+feature+
engineering+is+performed+on+
cluster
- Predictive+analysis+and+
modeling+can+be+performed+on+
single+workstation
- Powerful+tools+are+Hadoop
Streaming+and+Spark
combined+with+R+and+Python
- Data+Scientist+now+have+to+
worry+about+parallelisation of+
some+data+mining+tasks+
(ususally the+ones+that+are+
embarrassingly+parallel)
Cluster
- Large+data+sets
- Both+data+wrangling+and+
modeling+is+performed+on+
cluster
- Spark+is+one+of+the+few+tools+
that+support+efficient+parallel+
machine+learning
- Parallelising machine+learning+
algorithms+is+challenging
CONFIDENTIAL+ +++++++| 8© 2015 Think Big, a Teradata Company
Data-Lake-(HDFS)
Core-Data-ScienceProduction
• Dashboards
• R+Shiny+Apps
• Predictive+model+
scoring
Plug+&+play+model+deployment
Data-Sources-
(Operations,+
Sales,+
marketing,+etc)
Ingestion
Realatime+
Optimization+with+
Multiaarmed+Bandit
Data
• Integration+of+R+and+
Python+with+Hadoop and+
Spark
• Leveraging+computing+
power+of+Hadoop cluster+
for+distributed+analytics
• Plug+&+play+model+
deployment+tools+for+
easy+and+robust+
productionising of+
analytics+models
Realatime+Data
Productionising Analytics
CONFIDENTIAL+ +++++++| 9© 2015 Think Big, a Teradata Company
Project-KickVoff
Data-Profiling-
and-Exploratory-
Analysis
Analytics-
Modeling
Model-Validation Model-Publishing Reporting
Data-Science-Project
Data+Science+and+Analytics+Overview
CONFIDENTIAL+ +++++++| 10© 2015 Think Big, a Teradata Company
We+leverage+our+expertise+across+industries
Dynamic-Pricing
Fraud-Detection
Customer-Segmentation
Recommendation-
Engine
Predictive-Asset-
Maintenance
Proactive-
Customer-
Support
Credit-Default-
Prediction
Churn-Modeling
Scenario-Simulation
A/B-Testing
Display-Targeting-Optimisation
Demand-Forecast
Cluster-Analysis-&-
Segmentation
Device-Analytics
Risk-Analytics
Customer-Analytics
CONFIDENTIAL+ +++++++| 11© 2015 Think Big, a Teradata Company
Thank+you

Today’s reality Hadoop with Spark- How to select the best Data Science approach when using Big Data Platforms and Technologies?

  • 1.
    DATA SCIENCE DATA ENGINEERS DATASOLUTIONS Think&Big&Start&Smart&Scale&Fast Eliano Marques-– Senior-Data-Scientist Martin-Oberhuber-– Senior-Data-Scientist
  • 2.
    CONFIDENTIAL+ +++++++| 2©2015 Think Big, a Teradata Company Think+Big+History 1st SI+Solution+Provider+with+100%+focus+on+open+source+ and+Big+Data+Hadoop ecosystem • 100++Successful+Programs • 70++Clients • Global+Delivery+Capabilities • We-are-hiring
  • 3.
    CONFIDENTIAL+ +++++++| 3©2015 Think Big, a Teradata Company Think-Big-Clients Trusted&Analytics&Services&Provider&to&the&Fortune&1000 eCommerce 2+of+Global+Top+5 Internet-Transaction-Security Global #1 Retail 2+of+Global+Top+5 Brokerage &-Mutual-Funds 2+of+Global+Top+5 Social-Networking Global #1 Asset-Management Global #1 Credit-Issuer 2+of Global+Top+5 Semiconductor 2+of+Global Top+5 Banking 4+of+Global Top+10 Data Storage-Devices 3+of+Global Top+5 Financial Data-Services 2+of+Global+Top+5 Disk Manufacturing Global+#1 Financial-Exchanges Global #2 Telecommunications 2+of+Global Top+5 Media-& Advertising 2+of+Global+Top+5
  • 4.
    CONFIDENTIAL+ +++++++| 4©2015 Think Big, a Teradata Company Think+Big+VELOCITY Methodology Big+Data Strategy Think+Big Academy Big+Data Program+Mgt Business Analytics Managed+ Services Data+ Engineering Big+Data+Lab Think+Big+engages+with+it’s+client’s+business,+technical,+analyst+and+support+teams+in+ an+agile+inspired+VELOCITY+Methodology+to+continuously+develop+Big+Data+solutions+
  • 5.
    CONFIDENTIAL+ +++++++| 5©2015 Think Big, a Teradata Company What+is+Apache+Spark?+ • Open+source+Apache+project − Parallel+middleware+for+server+ clusters − Spark.apache.org+(2014) • Developed+by+UC+Berkeley’s+ AMPLab − Supported+by+Databricks • Top+use+cases − SQLaonaHadoop − Machine+learning − Streaming+data+miniabatches
  • 6.
    CONFIDENTIAL+ +++++++| 6©2015 Think Big, a Teradata Company Apache-Spark-Core-Engine Spark-SQL Spark- Streaming MLib (Machine-learning) GraphX (Graph) Scala,-R-(SparkR),-Python-(PySpark) What+is+Apache+Spark?+
  • 7.
    CONFIDENTIAL+ +++++++| 7©2015 Think Big, a Teradata Company Data+Science+Approaches 7 Single-Workstation - Small+data+sets - No+distributed+analytics+ across+multiple+nodes - Powerful+tools+are+R+or+ Python - Data+Scientist+can+focus+on+ business+problem Mixed Single/Workstation/+/Cluster - Small+or+large+data+sets - Data+wrangling+and+feature+ engineering+is+performed+on+ cluster - Predictive+analysis+and+ modeling+can+be+performed+on+ single+workstation - Powerful+tools+are+Hadoop Streaming+and+Spark combined+with+R+and+Python - Data+Scientist+now+have+to+ worry+about+parallelisation of+ some+data+mining+tasks+ (ususally the+ones+that+are+ embarrassingly+parallel) Cluster - Large+data+sets - Both+data+wrangling+and+ modeling+is+performed+on+ cluster - Spark+is+one+of+the+few+tools+ that+support+efficient+parallel+ machine+learning - Parallelising machine+learning+ algorithms+is+challenging
  • 8.
    CONFIDENTIAL+ +++++++| 8©2015 Think Big, a Teradata Company Data-Lake-(HDFS) Core-Data-ScienceProduction • Dashboards • R+Shiny+Apps • Predictive+model+ scoring Plug+&+play+model+deployment Data-Sources- (Operations,+ Sales,+ marketing,+etc) Ingestion Realatime+ Optimization+with+ Multiaarmed+Bandit Data • Integration+of+R+and+ Python+with+Hadoop and+ Spark • Leveraging+computing+ power+of+Hadoop cluster+ for+distributed+analytics • Plug+&+play+model+ deployment+tools+for+ easy+and+robust+ productionising of+ analytics+models Realatime+Data Productionising Analytics
  • 9.
    CONFIDENTIAL+ +++++++| 9©2015 Think Big, a Teradata Company Project-KickVoff Data-Profiling- and-Exploratory- Analysis Analytics- Modeling Model-Validation Model-Publishing Reporting Data-Science-Project Data+Science+and+Analytics+Overview
  • 10.
    CONFIDENTIAL+ +++++++| 10©2015 Think Big, a Teradata Company We+leverage+our+expertise+across+industries Dynamic-Pricing Fraud-Detection Customer-Segmentation Recommendation- Engine Predictive-Asset- Maintenance Proactive- Customer- Support Credit-Default- Prediction Churn-Modeling Scenario-Simulation A/B-Testing Display-Targeting-Optimisation Demand-Forecast Cluster-Analysis-&- Segmentation Device-Analytics Risk-Analytics Customer-Analytics
  • 11.
    CONFIDENTIAL+ +++++++| 11©2015 Think Big, a Teradata Company Thank+you