AUGUST 2016
Big Data Analytics for BI/BA/QA
Dmitry Tolpeko
2
BIG DATA
Why was it invented?
How is it used now?
How will it be used in the near future?
What do we need to do to stay competitive?
3
FIRST QUESTIONS
What size does it start?
Is it just another technology vendor?
4
IN REALITY
It is very easy to start using Hadoop
and Cloud now.
So it is true that now most people doing
traditional things with just larger data
sets.
And at much lower cost, of course.
So it looks like the size matters, and this
is just another technology
5
BUT IT IS …
Completely new mindset and
approach to analytics
Solution to satisfy new, “mass
market” analytics
And you cannot skip it
6
YOU CAN FEEL THIS AS …
Developers (Java, .NET etc.), non-
BI and even non-IT people talk and
work with analytics today.
That was not the case before.
So what happens?
7
TRADITIONAL ANALYTICS
Expensive
Separate and isolated BI world
Analyzing transactions (data you
cannot afford to lose or calculate
with errors)
Historical data and strategic decisions
8
AND TODAY THIS IS …
Very small % of analytics (1-5%?)
Analytics Boom
9
EVERTHING IS ABOUT DATA
Mindset: Data Analysis
not OLTP, DWH, ETL
Kimball/Inmon
Any application: UX+Analytics
(Machine Learning i.e.)
Competing on analytics, not just
product and service
Analytics become operational,
mass market
10
THE NEXT BIG SHIFT?
 Digital Transformation of Economy
IoT, VR, AR, Machine Learning, AI
Personalized UX
Heavily relies on analytics
11
ANALYTICS TODAY
 Fast, Advanced and Predictive
Analytics
o Personalization and customization: from
summary reports to a lot of tailored
data-driven actions (in near real time)
o Fast prototyping, implementation,
deployment and fast performance
o Data lakes
12
EXAMPLE - YESTERDAY
Company sends promo by email to
1M users paying $1 for each email,
50,000 users purchased goods at
$25
Profit: 50,000 * $25 - $1M =
$250,000
This is what traditional analytics
does.
13
EXAMPLE - TODAY
Today
Company identified to send promo
email just to 100,000 users, now
30,000 users purchased goods at $25
Profit: 30,000 * $25 - $100K =
$650,000
No new customers, no new
contracts – just algorithms and more
data
14
USE CASES
o Anomaly Detection
o Recommendation Systems
o Loyalty and Retention Programs
o Optimization
o A/B Testing
o Alarms, Scoring, Diagnosis
o Demand Forecasting and so on.
15
NEW CORE SKILLS
Distributed Data Processing and
Streaming Analytics
Programming (Python, R, Spark)
Math, Statistics
Machine Learning
Deep Learning
16
MACHINE LEARNING
Automation of discovery
Automatically adapt to new
circumstances
Detect patterns
In wide use now. “Self-testing”.
Few lines of code
17
BUILDING BLOCKS
Enriching analysis, development and
quality in software development
o Generic algorithms vs hardcoding
endless IF-ELSE
o Discovering hidden, not obvious
patterns
o Finding anomalies, outliers vs test
cases
18
BI TOOLS NOW
Self-service (less jobs?)
Advanced analytics (requires
understanding stats and machine
learning fundamentals)
19
SOURCE DATA
Non-transactional systems, weak or
no data model
Calculations with probability
Raw, unstructured data from
diverse data sources
Extracting small relevant pieces of
data from huge data sets
20
PEOPLE
Data engineers
Data scientists
Significant work force, not just 1-
5% as in BI
21
GOOD NEWS
BI people still good match as they
love crunching data
But significant shift in skills is
required
22
WHY TO BE INVOLVED
o Cutting edge
o Challenges
o Cool staff (predictions, AI
etc.)
o Growth, margin and revenue
23
HOW TO BE INVOLVED
o Mindset
o Skills
o Experience
o Solutions
24
PLATFORMS
25
TRADITIONAL EDW PLATFORMS
o Too expensive ($10,000 per TB and more)
o Large upfront cost
o Not easy procurement, setup and
maintenance
o Designed for relational data, SQL interface
only, limited schema flexibility
o Data must be loaded first (modeled,
prepared and moved)
o Marketing limitations for Appliances
26
TRADITIONAL OPEN SOURCE PLATFORMS
• Designed for relational data, SQL interface
only, limited schema flexibility
• Data must be loaded first (modeled,
prepared and moved)
• Not easily scalable (scale up and down)
27
TRADITIONAL DATA MINING TOOLS
• Expensive
• Smaller community (one more isolated
world)
• Targeted for enterprise users
• Longer release cycles, no way to mix tools
and try fresh new staff etc.
• Scalability and integration issues
28
WHY BIG DATA AND CLOUD
o Extremely economically attractive
o Scalable and elastic
o Self service
o Rich and diverse data tools
o Good enough quality (and
constantly improving)
29
BIG DATA AND CLOUD DESIGN PRINCIPLES
Decoupling Data Storage and Computing
o Database engine does not own data anymore
o Simplified load/extract
o Schema on read
o Not just SQL interface
o Any computing engines on top of data
Commodity Hardware
o Fault tolerant
Scale up and down
30
GROW PATH
From monolithic suites to diverse and rich tool set
SQL tools on Hadoop, Cloud
Advanced Data Analysis and Analytics
o Spark, MapReduce, NoSQL
o Python, R, Java, Scala
o Statistics
o Batch, Streaming, Real-time
Machine Learning and Deep Learning
o Understand use cases
o Understand specific algorithms and their
application
o Implementation
31
GAME (HOME WORK)
32
LET’S WIN THIS CAR
Suppose you're on a game show, and
you're given the choice of three
doors:
Behind one door is a car; behind the
others, goats.
You pick a door, say No. 3
33
SWITCH OR NOT?
Then the host, who knows what's
behind the doors, opens another
door, say No. 2, which has a goat.
He then says to you, "Do you want
to pick door No. 1?"
Is it to your advantage to switch
your choice?

Big Data Analytics for BI, BA and QA

  • 1.
    AUGUST 2016 Big DataAnalytics for BI/BA/QA Dmitry Tolpeko
  • 2.
    2 BIG DATA Why wasit invented? How is it used now? How will it be used in the near future? What do we need to do to stay competitive?
  • 3.
    3 FIRST QUESTIONS What sizedoes it start? Is it just another technology vendor?
  • 4.
    4 IN REALITY It isvery easy to start using Hadoop and Cloud now. So it is true that now most people doing traditional things with just larger data sets. And at much lower cost, of course. So it looks like the size matters, and this is just another technology
  • 5.
    5 BUT IT IS… Completely new mindset and approach to analytics Solution to satisfy new, “mass market” analytics And you cannot skip it
  • 6.
    6 YOU CAN FEELTHIS AS … Developers (Java, .NET etc.), non- BI and even non-IT people talk and work with analytics today. That was not the case before. So what happens?
  • 7.
    7 TRADITIONAL ANALYTICS Expensive Separate andisolated BI world Analyzing transactions (data you cannot afford to lose or calculate with errors) Historical data and strategic decisions
  • 8.
    8 AND TODAY THISIS … Very small % of analytics (1-5%?) Analytics Boom
  • 9.
    9 EVERTHING IS ABOUTDATA Mindset: Data Analysis not OLTP, DWH, ETL Kimball/Inmon Any application: UX+Analytics (Machine Learning i.e.) Competing on analytics, not just product and service Analytics become operational, mass market
  • 10.
    10 THE NEXT BIGSHIFT?  Digital Transformation of Economy IoT, VR, AR, Machine Learning, AI Personalized UX Heavily relies on analytics
  • 11.
    11 ANALYTICS TODAY  Fast,Advanced and Predictive Analytics o Personalization and customization: from summary reports to a lot of tailored data-driven actions (in near real time) o Fast prototyping, implementation, deployment and fast performance o Data lakes
  • 12.
    12 EXAMPLE - YESTERDAY Companysends promo by email to 1M users paying $1 for each email, 50,000 users purchased goods at $25 Profit: 50,000 * $25 - $1M = $250,000 This is what traditional analytics does.
  • 13.
    13 EXAMPLE - TODAY Today Companyidentified to send promo email just to 100,000 users, now 30,000 users purchased goods at $25 Profit: 30,000 * $25 - $100K = $650,000 No new customers, no new contracts – just algorithms and more data
  • 14.
    14 USE CASES o AnomalyDetection o Recommendation Systems o Loyalty and Retention Programs o Optimization o A/B Testing o Alarms, Scoring, Diagnosis o Demand Forecasting and so on.
  • 15.
    15 NEW CORE SKILLS DistributedData Processing and Streaming Analytics Programming (Python, R, Spark) Math, Statistics Machine Learning Deep Learning
  • 16.
    16 MACHINE LEARNING Automation ofdiscovery Automatically adapt to new circumstances Detect patterns In wide use now. “Self-testing”. Few lines of code
  • 17.
    17 BUILDING BLOCKS Enriching analysis,development and quality in software development o Generic algorithms vs hardcoding endless IF-ELSE o Discovering hidden, not obvious patterns o Finding anomalies, outliers vs test cases
  • 18.
    18 BI TOOLS NOW Self-service(less jobs?) Advanced analytics (requires understanding stats and machine learning fundamentals)
  • 19.
    19 SOURCE DATA Non-transactional systems,weak or no data model Calculations with probability Raw, unstructured data from diverse data sources Extracting small relevant pieces of data from huge data sets
  • 20.
    20 PEOPLE Data engineers Data scientists Significantwork force, not just 1- 5% as in BI
  • 21.
    21 GOOD NEWS BI peoplestill good match as they love crunching data But significant shift in skills is required
  • 22.
    22 WHY TO BEINVOLVED o Cutting edge o Challenges o Cool staff (predictions, AI etc.) o Growth, margin and revenue
  • 23.
    23 HOW TO BEINVOLVED o Mindset o Skills o Experience o Solutions
  • 24.
  • 25.
    25 TRADITIONAL EDW PLATFORMS oToo expensive ($10,000 per TB and more) o Large upfront cost o Not easy procurement, setup and maintenance o Designed for relational data, SQL interface only, limited schema flexibility o Data must be loaded first (modeled, prepared and moved) o Marketing limitations for Appliances
  • 26.
    26 TRADITIONAL OPEN SOURCEPLATFORMS • Designed for relational data, SQL interface only, limited schema flexibility • Data must be loaded first (modeled, prepared and moved) • Not easily scalable (scale up and down)
  • 27.
    27 TRADITIONAL DATA MININGTOOLS • Expensive • Smaller community (one more isolated world) • Targeted for enterprise users • Longer release cycles, no way to mix tools and try fresh new staff etc. • Scalability and integration issues
  • 28.
    28 WHY BIG DATAAND CLOUD o Extremely economically attractive o Scalable and elastic o Self service o Rich and diverse data tools o Good enough quality (and constantly improving)
  • 29.
    29 BIG DATA ANDCLOUD DESIGN PRINCIPLES Decoupling Data Storage and Computing o Database engine does not own data anymore o Simplified load/extract o Schema on read o Not just SQL interface o Any computing engines on top of data Commodity Hardware o Fault tolerant Scale up and down
  • 30.
    30 GROW PATH From monolithicsuites to diverse and rich tool set SQL tools on Hadoop, Cloud Advanced Data Analysis and Analytics o Spark, MapReduce, NoSQL o Python, R, Java, Scala o Statistics o Batch, Streaming, Real-time Machine Learning and Deep Learning o Understand use cases o Understand specific algorithms and their application o Implementation
  • 31.
  • 32.
    32 LET’S WIN THISCAR Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 3
  • 33.
    33 SWITCH OR NOT? Thenthe host, who knows what's behind the doors, opens another door, say No. 2, which has a goat. He then says to you, "Do you want to pick door No. 1?" Is it to your advantage to switch your choice?