The Future of Real-Time in Spark

The Future of
Real-Time in Spark
Reynold Xin @rxin
Spark Summit, New York, Feb 18,2016

Why Real-Time?
Making decisions faster is valuable.
• Preventingcreditcard fraud
• Monitoringindustrialmachinery
• Human-facingdashboards
• …

Streaming Engine
Noun.
Takes an input streamand producesan output stream.

SQL Streaming MLlib
Spark Core
GraphX
Spark Unified Stack

StreamingSQL MLlib
Spark Core
GraphXStreaming
Introduced3 years ago in Spark 0.7
50% usersconsider most important part of Spark
Spark Unified Stack

Spark Streaming
• First attempt at unifying streaming and batch
• State management built in
• Exactly once semantics
• Features required for large clusters
• Straggler mitigation,dynamic load balancing,fast fault-recovery

Streaming computations don’t run in isolation.

Use Case: Fraud Detection
STREAM
ANOMALY
Machine learningmodel
continuously updates
to detectnew anomalies
Ad-hocanalyze historic data

Continuous Application
noun.
An end-to-end application that acts on real-time data.

Challenges Building Continuous
Applications
Integration with non-streaming systems often an after-thought
• Interactive,batch,relational databases, machine learning,…
Streaming programming models are complex

Integration Example
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...

Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …

The simplest way to perform streaming analytics
is not having to reason about streaming.

Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API !

Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models

output for
data at 1
Result
Query
Time
data up
to PT 1
Input
complete
output
Output
1 2 3
Trigger: every 1 sec
data up
to PT 2
output for
data at 2
data up
to PT 3
output for
data at 3
Model

delta
output
output for
data at 1
Result
Query
Time
data up
to PT 2
data up
to PT 3
data up
to PT 1
Input
output for
data at 2
output for
data at 3
Output
1 2 3
Trigger: every 1 sec
Model

Model Details
Input sources:append-onlytables
Queries: newoperators for windowing, sessions, etc
Triggers:based on time (e.g. every 1 sec)
Output modes: complete, deltas, update-in-place

Example: ETL
Input: files in S3
Query: map (transform each record)
Trigger: “every5 sec”
Output mode: “newrecords”,into S3 sink

Example: Page View Count
Input: recordsin Kafka
Query: select count(*) group by page, minute(evtime)
Trigger: “every5 sec”
Output mode: “update-in-place”, into MySQL sink
Note: this will automatically update “old” recordson late data!

Logically:
DataFrame operations on static data
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
Logical Plan
Continuous,
incremental execution
Catalyst optimizer
Execution

logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation

logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Example: Continuous Aggregation

T = 0 Aggregate
AggregateT = 1
AggregateT = 2
…
Automatic Incremental Execution

Rest of Spark will follow
• Interactive queriesshould just work
• Spark’s data sourceAPI will be updated to support seamless
streaming integration
• Exactly once semantics end-to-end
• Different outputmodes (complete,delta, update-in-place)
• ML algorithms will be updated too

What can we do with this that’s hard
with other engines?
Ad-hoc, interactive queries
Dynamic changing queries
Benefits of Spark: elastic scaling, stragglermitigation, etc

Use Case: Fraud Detection
STREAM
ANOMALY
Machine LearningModel
continuously updates
to detectnew anomalies
Analyze Historic Data

Timeline
Spark 2.0
• API foundation
• Kafka, file systems, and
databases
• Event-time aggregations
Spark 2.1 +
• Continuous SQL
• BI app integration
• Other streaming sources/ sinks
• Machine learning

The Future of Real-Time in Spark

More Related Content

What's hot

Viewers also liked

Similar to The Future of Real-Time in Spark

Recently uploaded

In this document

The Future of Real-Time in Spark