Getting It Right Exactly Once:
Principles for Streaming Architectures
Darryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies
September 2016 | Strata+Hadoop World, NY
2
Getting Started
 I’m Darryl Smith
• Chief Data Platform Architect
and Distinguished Engineer
Dell Technologies
 Agenda
• Real-Time And The Need For Streaming
• Adding Real-Time And Streaming To The Data Lake
• Results, Plans, Lessons Learned
• Demonstration
3
Trickle, Flood, or Torrent…
Streaming is about
continuous data motion,
more than speed
or volume
4
The Conversation Around Streaming
Website and Mobile
Application Logs
Internet of Things
Sensors
The Enterprise Reality
5
Batch > Real-Time > Streaming
Enterprise Opportunities
Immediate Business Advantage
Website and Mobile
Application Logs
Internet of Things
Sensors
6
The Enterprise Streaming Play
Moving from batch to real-time streams
avoids surges, normalizes compute,
and drives value
7
Real time and the need for streaming
8
Drive DellEMC towards a
Predictive Enterprise via
intelligent data driving agility,
increasing revenue and
productivity resulting in a
competitive advantage
Analytics Vision
9
 Need to use new data for
competitive advantage
• Volume, Variety and Velocity
 Leverage near real time and
streaming data sets to
optimize predictions
• Make faster, better decisions
 Cost-effectively scale to
improve query and load
performance
 Put the data in the hands of
the business
Becoming An Analytical Enterprise
DRIVE
COMPETITIVE
ADVANTAGE
COST-
EFFECTIVELY
SCALE
DATA ACCESS
BY BUSINESS
NEAR
REAL-TIME
ANALYTICS
10
Problem Statement
Teams do not have access
to maintenance renewal
quotes in the timeframes
or the degree of quality
which they need for Tech
Refresh and Renewal
sales.
Desired Outcome
Implement a cost-effective,
real-time solution that
improves productivity
and gives confidence to
produce desired outcomes
efficiently.
Scoping The Business Objectives
11
Business Drivers
CURRENT REALITY
VISION FOR THE
FUTURE
TO REALIZE
THIS VISION:
IMPLEMENT
CALM
SOLUTION
PHASES AND
OPTIMZE
BUSINESS
PROCESSES
HIGH TOUCH
TACTICAL EXECUTION
LOW TOUCH SELF
SERVICE
DATE DRIVEN
PROCESSES
BUSINESS VALUE
DRIVEN PROCESSES
INEFFICENCIES &
LOST PRODUCTITY
INCREASED
PRODUCTIVITY
SILOED DATA /
LIMITED VIEWS
SINGLE VIEW OF
DATA/DATA SCORING
VARIABLE DATA
QUALITY
DATA QUALITY &
CONFIDENCE
12
The Need for “CALM”
Customer Asset Lifecycle Management
For
enterprise sales
Who need
accurate and timely customer information
CALM is a
real-time application
Providing
up to the moment customer 360 dashboards
For enterprise sales
Who need accurate and timely customer information
CALM is a real-time application
Providing up to the moment customer 360
o
dashboards
Install Base
Pricing
Device Config
Contacts
Contracts
Analytics Contracts
Component
Data
Offers
Scorecard
13
Data Lake Architecture
D A T A P L A T F O R M
V M W A R E V C L O U D S U I T E
E X E C U T I O N
P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD
Gemfire
H A D O O P
INGESTION
DATAGOVERNANCE
Cassandra PostgreSQL MemSQL
HDFS ON ISILON
HADOOP ON SCALEIO
VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN
A N A L Y T I C S
T O O L B O X
Network WebSensor SupplierSocial Media Market
S T R U C T U R E DU N S T R U C T U R E D
CRM PLMERP
APPLICATIONS
ApacheRangerAttivioCollibra
Real-TimeMicro-BatchBatch
14
Data Ingestion
• Small to Big Data (high-throughput)
• Structured and unstructured Data from any Source
• Streams and Batches
• Secure, multi-tenant, configurable Framework
Real-Time Analytics
• Tap into streams for in-memory Analytics
• Real Time Data insights and decisions
Services
• Data Ingestion to Data Lake
• Data Lake APIs
• Data Alerting
Business Data Lake Offerings
Unstructured
Structured
15
Adding Real Time and Streaming
to the Data Lake
16
Seeking A Fast Database
A compliment to the business data lake
O P C M
HammerDB Platform Benchmarks
HammerDB workloads testing was done following EMC’s Oracle and SQL Server
DBA Teams standard practices.
 Definition of workload. Mix of 5 transactions as follows:
• New order: receive a new order from a customer: 45%
• Payment: update the customer balance to record a payment: 43%
• Delivery: deliver orders asynchronously: 4%
• Order status: retrieve the status of customer’s most recent order: 4%
• Stock level: return the status of the warehouse’s inventory: 4%
 Testing scenario:
• 100 warehouses 8 vUsers. Database creation and initial data loading.
• Timed testing. 20 minutes per each testing session.
• Scaled number of virtual users for each testing session from 1 until 44.
 No changes done to the systems and databases configuration while running the
test.
HammerDB Workload Testing
 Each test was 16 vCPU x 32 GB RAM
• RedHat 6.4
• Oracle 11g R2
• Windows Core 2012 R2
• SQL Server 2012 Ent Ed.
• RedHat 6.4
• PostgreSQL 9.3.3
HammerDB Workload - Results
Results
Query PostgreSQL MemSQL
Opportunity(5K) 5 seconds 200ms
Sales Order(170K) 1-1.5 Minutes 6 seconds
Territory(60K) 60 seconds 5 seconds
PostgreSQL vs In-Memory DB
We picked 5 top queries run by different business functions.
Presented here are 3 queries that had response times that did not meet the SLA.
21
Business Data Lake – Ingestion to Fulfillment
Raw Data
Summary
Data
DATAGOVERNOR
Consumers
Predictive/
Prescriptive
Analytics
Processed
Data
Analytical Data
GREENPLUM DATABASE
HADOOP
RAW
Data
INGEST
MANAGER
SPRING XD
SPARK
SQOOP
Execution Tier
CASSANDRAGEMFIRE
MEMSQL POSTGRESQL
Real-Time
Tap
22
Here Are The Data Flows We Built
Low Velocity
Batch
Real-Time
23
Data Flow Patterns – Low Velocity
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
Presentation [SPEED/SERVING]
GREENPLUM
DATABASE
PIVOTAL HD
POSTGRESQL
MEMSQL
Raw
Data
One-Time
CASSANDRA
GEMFIRE
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
GREENPLUM
DATABASE
PIVOTAL HD
24
Data Flow Patterns – Batch
Batch
Presentation [SPEED/SERVING]
POSTGRESQL
MEMSQL CASSANDRA
GEMFIRE
25
Data Flow Patterns – Real Time
Real-time
Initial Load
Analytical [BATCH]
Ingestion
Data
Service
JDBC
Application
GREENPLUM
DATABASE
PIVOTAL HD
Presentation [SPEED/SERVING]
POSTGRESQL
MEMSQL CASSANDRA
GEMFIRE
26
Nothing Closer To Real Time Than Streaming
 Let’s look at the leading edge
 Apache Kafka
 Messaging Semantics
• At most once
• At least once
• Exactly once
27
At most once
000
?
01 02 03 04
28
At least once
01 02 03 04
000
?
29
Exactly Once
000
01 02 03 04
01
30
Understanding Streaming Semantics
At most once At least once Exactly once
Message pulled once Message pulled one or
more times;
processed each time
Message pulled one or
more times;
processed once
May or may not be
received
Receipt guaranteed Receipt guaranteed
No duplicates Likely duplicates No duplicates
Possible missing data No missing data No missing data
000
? 000000
?
01
01
01
31
Rendering In Real Time
 Picking the right business intelligence layer
• Tableau
• Custom Application (CF, D3, Docker)
• Additional Third Party Solutions
32
Results, Plans, Lessons Learned
33
Business Benefits
DATA QUERYING
Down from 4 hours per quarter
to less than 1 minute per year
SIMPLIFIED
PROVISIONING
Reduced number of tables/report
required
DATA
GOVERNANCE
Provides one version of
the truth
TIME TO MARKET
Reduced number of tables/report
required
TOOL
AGNOSTIC
Business logic in the DB not
the tool provides increased
flexibility
34
Use Case: Customer Account Profile
 STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW
Service Request
Contracts
Installed Base
Bookings
Billings
EMC DATA
LAKE
BDL
SERVICES
DATA
WORKSPACES
DATA INGESTION
Prof Services
23 BUSINESS MANAGED WORKSPACES
35
Customer Asset Lifecycle Management
Platform Roadmap
Phase 1 : Foundational
Capabilities/Discovery
Phase 2 : Scale Platform /
Automate
Future Phases : Global Standard tool
Integrations , advanced Analytics
BAaaS/Tableau
Scalable
Platform
Integrated
Platform
GBS
Renewals
Inside
Sales
Additional
Business groups
Oct 2015 2016 TBDAug 2015
BDL Platform
Enablement CollaborationAcceleration
In-Memory Capabilities
(POC)
We are here
36
Data Services Roadmap
Security
Planned integration into
custom BDL security API for
managing Role Based Access
Control (RBAC) to the
underlying data
Business Data Lake Plans
37
Lessons Learned – Key Takeaways
EDUCATE ASSESS INFRASTRUCTURE JOURNEY
Educate the
business
Use examples of
business impact
Assess in-house
big data skills
Ensure plan to
support the
organization for 3-
5 years
Choose the best
possible infrastructure
Make sure your Big
Data technology
platform can evolve
Remember it is a
journey
Look for small wins
as well as big wins.
38
Lessons Learned: Analytics and Data
Sourcing the right skills, working with a different philosophy,
and some new tools will help you meet your analytical goals
TRANSFORM YOUR
PEOPLE
CHANGE YOUR
PROCESSES
ADAPT YOUR
TECHNOLOGY
 Data science in the
organization, IT or both?
 Helping business units
take initiative
 New philosophy to
running analytics projects
 How and when to share
data
 Steadily refine toolsets
based on needed analysis
 Identify to infrastructure
layers
39
Demonstration
40
Demo Agenda
Showcase exactly-once semantics from Kafka
1: Data set of 200,000 transactions summing to zero
2: CREATE TABE AND CREATE PIPELINE
3: Push to Kafka and confirm exactly-once
4: Validate Resiliency and confirm exactly-once
Step 1: Data Source
 start with a data set of 200,000 transactions representing
money/goods that sum to zero
 200,000 transactions
• Transaction number
• Increase / Decrease
• Amount
Step 2: CREATE TABLE AND CREATE PIPELINE
 create a table and pipeline in MemSQL that subscribes to
that Kafka topic
CREATE TABLE
CREATE PIPELINE
Step 3: Push to Kafka
 Push that data set to Kafka
 Validate exactly-once delivery by querying MemSQL
• show tables;
• show pipelines;
• select sum(amount) from transactions;
 Should be 0 in the demo
• select count(*) from transactions;
 Should be 200,000 in the demo
46
Step 4: Resiliency
 induce a failures to show resiliency during exactly-once
workflows
a. randomly_fail_batches.py
b. restart Kafka and show error count
c. continue and validate exactly-once semantics
48
Errors
Total
Transactions
Sum
The mission is clear:
We’re moving
from batch to real-time
with streaming
Thank You
Darryl Smith
Chief Data Platform Architect and Distinguished Engineer
Dell Technologies

Getting It Right Exactly Once: Principles for Streaming Architectures

  • 1.
    Getting It RightExactly Once: Principles for Streaming Architectures Darryl Smith, Chief Data Platform Architect and Distinguished Engineer, Dell Technologies September 2016 | Strata+Hadoop World, NY
  • 2.
    2 Getting Started  I’mDarryl Smith • Chief Data Platform Architect and Distinguished Engineer Dell Technologies  Agenda • Real-Time And The Need For Streaming • Adding Real-Time And Streaming To The Data Lake • Results, Plans, Lessons Learned • Demonstration
  • 3.
    3 Trickle, Flood, orTorrent… Streaming is about continuous data motion, more than speed or volume
  • 4.
    4 The Conversation AroundStreaming Website and Mobile Application Logs Internet of Things Sensors
  • 5.
    The Enterprise Reality 5 Batch> Real-Time > Streaming Enterprise Opportunities Immediate Business Advantage Website and Mobile Application Logs Internet of Things Sensors
  • 6.
    6 The Enterprise StreamingPlay Moving from batch to real-time streams avoids surges, normalizes compute, and drives value
  • 7.
    7 Real time andthe need for streaming
  • 8.
    8 Drive DellEMC towardsa Predictive Enterprise via intelligent data driving agility, increasing revenue and productivity resulting in a competitive advantage Analytics Vision
  • 9.
    9  Need touse new data for competitive advantage • Volume, Variety and Velocity  Leverage near real time and streaming data sets to optimize predictions • Make faster, better decisions  Cost-effectively scale to improve query and load performance  Put the data in the hands of the business Becoming An Analytical Enterprise DRIVE COMPETITIVE ADVANTAGE COST- EFFECTIVELY SCALE DATA ACCESS BY BUSINESS NEAR REAL-TIME ANALYTICS
  • 10.
    10 Problem Statement Teams donot have access to maintenance renewal quotes in the timeframes or the degree of quality which they need for Tech Refresh and Renewal sales. Desired Outcome Implement a cost-effective, real-time solution that improves productivity and gives confidence to produce desired outcomes efficiently. Scoping The Business Objectives
  • 11.
    11 Business Drivers CURRENT REALITY VISIONFOR THE FUTURE TO REALIZE THIS VISION: IMPLEMENT CALM SOLUTION PHASES AND OPTIMZE BUSINESS PROCESSES HIGH TOUCH TACTICAL EXECUTION LOW TOUCH SELF SERVICE DATE DRIVEN PROCESSES BUSINESS VALUE DRIVEN PROCESSES INEFFICENCIES & LOST PRODUCTITY INCREASED PRODUCTIVITY SILOED DATA / LIMITED VIEWS SINGLE VIEW OF DATA/DATA SCORING VARIABLE DATA QUALITY DATA QUALITY & CONFIDENCE
  • 12.
    12 The Need for“CALM” Customer Asset Lifecycle Management For enterprise sales Who need accurate and timely customer information CALM is a real-time application Providing up to the moment customer 360 dashboards For enterprise sales Who need accurate and timely customer information CALM is a real-time application Providing up to the moment customer 360 o dashboards Install Base Pricing Device Config Contacts Contracts Analytics Contracts Component Data Offers Scorecard
  • 13.
    13 Data Lake Architecture DA T A P L A T F O R M V M W A R E V C L O U D S U I T E E X E C U T I O N P R O C E S S GREENPLUM DBSPRING XD PIVOTAL HD Gemfire H A D O O P INGESTION DATAGOVERNANCE Cassandra PostgreSQL MemSQL HDFS ON ISILON HADOOP ON SCALEIO VCE VBLOCK/VxRACK | XTREMIO | DATA DOMAIN A N A L Y T I C S T O O L B O X Network WebSensor SupplierSocial Media Market S T R U C T U R E DU N S T R U C T U R E D CRM PLMERP APPLICATIONS ApacheRangerAttivioCollibra Real-TimeMicro-BatchBatch
  • 14.
    14 Data Ingestion • Smallto Big Data (high-throughput) • Structured and unstructured Data from any Source • Streams and Batches • Secure, multi-tenant, configurable Framework Real-Time Analytics • Tap into streams for in-memory Analytics • Real Time Data insights and decisions Services • Data Ingestion to Data Lake • Data Lake APIs • Data Alerting Business Data Lake Offerings Unstructured Structured
  • 15.
    15 Adding Real Timeand Streaming to the Data Lake
  • 16.
    16 Seeking A FastDatabase A compliment to the business data lake O P C M
  • 17.
    HammerDB Platform Benchmarks HammerDBworkloads testing was done following EMC’s Oracle and SQL Server DBA Teams standard practices.  Definition of workload. Mix of 5 transactions as follows: • New order: receive a new order from a customer: 45% • Payment: update the customer balance to record a payment: 43% • Delivery: deliver orders asynchronously: 4% • Order status: retrieve the status of customer’s most recent order: 4% • Stock level: return the status of the warehouse’s inventory: 4%  Testing scenario: • 100 warehouses 8 vUsers. Database creation and initial data loading. • Timed testing. 20 minutes per each testing session. • Scaled number of virtual users for each testing session from 1 until 44.  No changes done to the systems and databases configuration while running the test.
  • 18.
    HammerDB Workload Testing Each test was 16 vCPU x 32 GB RAM • RedHat 6.4 • Oracle 11g R2 • Windows Core 2012 R2 • SQL Server 2012 Ent Ed. • RedHat 6.4 • PostgreSQL 9.3.3
  • 19.
    HammerDB Workload -Results Results
  • 20.
    Query PostgreSQL MemSQL Opportunity(5K)5 seconds 200ms Sales Order(170K) 1-1.5 Minutes 6 seconds Territory(60K) 60 seconds 5 seconds PostgreSQL vs In-Memory DB We picked 5 top queries run by different business functions. Presented here are 3 queries that had response times that did not meet the SLA.
  • 21.
    21 Business Data Lake– Ingestion to Fulfillment Raw Data Summary Data DATAGOVERNOR Consumers Predictive/ Prescriptive Analytics Processed Data Analytical Data GREENPLUM DATABASE HADOOP RAW Data INGEST MANAGER SPRING XD SPARK SQOOP Execution Tier CASSANDRAGEMFIRE MEMSQL POSTGRESQL Real-Time Tap
  • 22.
    22 Here Are TheData Flows We Built Low Velocity Batch Real-Time
  • 23.
    23 Data Flow Patterns– Low Velocity Analytical [BATCH] Ingestion Data Service JDBC Application Presentation [SPEED/SERVING] GREENPLUM DATABASE PIVOTAL HD POSTGRESQL MEMSQL Raw Data One-Time CASSANDRA GEMFIRE
  • 24.
    Analytical [BATCH] Ingestion Data Service JDBC Application GREENPLUM DATABASE PIVOTAL HD 24 DataFlow Patterns – Batch Batch Presentation [SPEED/SERVING] POSTGRESQL MEMSQL CASSANDRA GEMFIRE
  • 25.
    25 Data Flow Patterns– Real Time Real-time Initial Load Analytical [BATCH] Ingestion Data Service JDBC Application GREENPLUM DATABASE PIVOTAL HD Presentation [SPEED/SERVING] POSTGRESQL MEMSQL CASSANDRA GEMFIRE
  • 26.
    26 Nothing Closer ToReal Time Than Streaming  Let’s look at the leading edge  Apache Kafka  Messaging Semantics • At most once • At least once • Exactly once
  • 27.
  • 28.
    28 At least once 0102 03 04 000 ?
  • 29.
  • 30.
    30 Understanding Streaming Semantics Atmost once At least once Exactly once Message pulled once Message pulled one or more times; processed each time Message pulled one or more times; processed once May or may not be received Receipt guaranteed Receipt guaranteed No duplicates Likely duplicates No duplicates Possible missing data No missing data No missing data 000 ? 000000 ? 01 01 01
  • 31.
    31 Rendering In RealTime  Picking the right business intelligence layer • Tableau • Custom Application (CF, D3, Docker) • Additional Third Party Solutions
  • 32.
  • 33.
    33 Business Benefits DATA QUERYING Downfrom 4 hours per quarter to less than 1 minute per year SIMPLIFIED PROVISIONING Reduced number of tables/report required DATA GOVERNANCE Provides one version of the truth TIME TO MARKET Reduced number of tables/report required TOOL AGNOSTIC Business logic in the DB not the tool provides increased flexibility
  • 34.
    34 Use Case: CustomerAccount Profile  STREAMLINED analytics ENVIRONMENT TO GAIN A HOLISTIC CUSTOMER VIEW Service Request Contracts Installed Base Bookings Billings EMC DATA LAKE BDL SERVICES DATA WORKSPACES DATA INGESTION Prof Services 23 BUSINESS MANAGED WORKSPACES
  • 35.
    35 Customer Asset LifecycleManagement Platform Roadmap Phase 1 : Foundational Capabilities/Discovery Phase 2 : Scale Platform / Automate Future Phases : Global Standard tool Integrations , advanced Analytics BAaaS/Tableau Scalable Platform Integrated Platform GBS Renewals Inside Sales Additional Business groups Oct 2015 2016 TBDAug 2015 BDL Platform Enablement CollaborationAcceleration In-Memory Capabilities (POC) We are here
  • 36.
    36 Data Services Roadmap Security Plannedintegration into custom BDL security API for managing Role Based Access Control (RBAC) to the underlying data Business Data Lake Plans
  • 37.
    37 Lessons Learned –Key Takeaways EDUCATE ASSESS INFRASTRUCTURE JOURNEY Educate the business Use examples of business impact Assess in-house big data skills Ensure plan to support the organization for 3- 5 years Choose the best possible infrastructure Make sure your Big Data technology platform can evolve Remember it is a journey Look for small wins as well as big wins.
  • 38.
    38 Lessons Learned: Analyticsand Data Sourcing the right skills, working with a different philosophy, and some new tools will help you meet your analytical goals TRANSFORM YOUR PEOPLE CHANGE YOUR PROCESSES ADAPT YOUR TECHNOLOGY  Data science in the organization, IT or both?  Helping business units take initiative  New philosophy to running analytics projects  How and when to share data  Steadily refine toolsets based on needed analysis  Identify to infrastructure layers
  • 39.
  • 40.
    40 Demo Agenda Showcase exactly-oncesemantics from Kafka 1: Data set of 200,000 transactions summing to zero 2: CREATE TABE AND CREATE PIPELINE 3: Push to Kafka and confirm exactly-once 4: Validate Resiliency and confirm exactly-once
  • 41.
    Step 1: DataSource  start with a data set of 200,000 transactions representing money/goods that sum to zero
  • 42.
     200,000 transactions •Transaction number • Increase / Decrease • Amount
  • 43.
    Step 2: CREATETABLE AND CREATE PIPELINE  create a table and pipeline in MemSQL that subscribes to that Kafka topic
  • 44.
  • 45.
    Step 3: Pushto Kafka  Push that data set to Kafka  Validate exactly-once delivery by querying MemSQL • show tables; • show pipelines; • select sum(amount) from transactions;  Should be 0 in the demo • select count(*) from transactions;  Should be 200,000 in the demo
  • 46.
  • 47.
    Step 4: Resiliency induce a failures to show resiliency during exactly-once workflows a. randomly_fail_batches.py b. restart Kafka and show error count c. continue and validate exactly-once semantics
  • 48.
  • 49.
  • 50.
    The mission isclear: We’re moving from batch to real-time with streaming
  • 51.
    Thank You Darryl Smith ChiefData Platform Architect and Distinguished Engineer Dell Technologies