The Future of
Real-Time in Spark
Reynold Xin @rxin
Spark Summit, New York, Feb 18,2016
Why Real-Time?
Making decisions faster is valuable.
• Preventingcreditcard fraud
• Monitoringindustrialmachinery
• Human-facingdashboards
• …
Streaming Engine
Noun.
Takes an input streamand producesan output stream.
SQL Streaming MLlib
Spark Core
GraphX
Spark Unified Stack
StreamingSQL MLlib
Spark Core
GraphXStreaming
Introduced3 years ago in Spark 0.7
50% usersconsider most important part of Spark
Spark Unified Stack
Spark Streaming
• First attempt at unifying streaming and batch
• State management built in
• Exactly once semantics
• Features required for large clusters
• Straggler mitigation,dynamic load balancing,fast fault-recovery
Streaming computations don’t run in isolation.
Use Case: Fraud Detection
STREAM
ANOMALY
Machine learningmodel
continuously updates
to detectnew anomalies
Ad-hocanalyze historic data
Continuous Application
noun.
An end-to-end application that acts on real-time data.
Challenges Building Continuous
Applications
Integration with non-streaming systems often an after-thought
• Interactive,batch,relational databases, machine learning,…
Streaming programming models are complex
Integration Example
Streaming
engine
Stream
(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?
• Late events
• Partial outputs to MySQL
• State recovery on failure
• Distributed reads/writes
• ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
Processing
Businesslogic change & new ops
(windows,sessions)
Complex Programming Models
Output
How do we define
outputover time & correctness?
Data
Late arrival, varying distribution overtime, …
Structured Streaming
The simplest way to perform streaming analytics
is not having to reason about streaming.
Spark 2.0
Infinite DataFrames
Spark 1.3
Static DataFrames
Single API !
Structured Streaming
High-level streaming API built on SparkSQL engine
• Runsthe same querieson DataFrames
• Eventtime, windowing,sessions,sources& sinks
Unifies streaming, interactive and batch queries
• Aggregate data in a stream, then serve using JDBC
• Change queriesatruntime
• Build and apply ML models
output for
data at 1
Result
Query
Time
data up
to PT 1
Input
complete
output
Output
1 2 3
Trigger: every 1 sec
data up
to PT 2
output for
data at 2
data up
to PT 3
output for
data at 3
Model
delta
output
output for
data at 1
Result
Query
Time
data up
to PT 2
data up
to PT 3
data up
to PT 1
Input
output for
data at 2
output for
data at 3
Output
1 2 3
Trigger: every 1 sec
Model
Model Details
Input sources:append-onlytables
Queries: newoperators for windowing, sessions, etc
Triggers:based on time (e.g. every 1 sec)
Output modes: complete, deltas, update-in-place
Example: ETL
Input: files in S3
Query: map (transform each record)
Trigger: “every5 sec”
Output mode: “newrecords”,into S3 sink
Example: Page View Count
Input: recordsin Kafka
Query: select count(*) group by page, minute(evtime)
Trigger: “every5 sec”
Output mode: “update-in-place”, into MySQL sink
Note: this will automatically update “old” recordson late data!
Logically:
DataFrame operations on static data
(i.e. as easyto understand as batch)
Physically:
Spark automatically runs the queryin
streaming fashion
(i.e. incrementally and continuously)
DataFrame
Logical Plan
Continuous,
incremental execution
Catalyst optimizer
Execution
logs = ctx.read.format("json").open("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.save("jdbc:mysql//...")
Example: Batch Aggregation
logs = ctx.read.format("json").stream("s3://logs")
logs.groupBy(logs.user_id).agg(sum(logs.time))
.write.format("jdbc")
.stream("jdbc:mysql//...")
Example: Continuous Aggregation
T = 0 Aggregate
AggregateT = 1
AggregateT = 2
…
Automatic Incremental Execution
Rest of Spark will follow
• Interactive queriesshould just work
• Spark’s data sourceAPI will be updated to support seamless
streaming integration
• Exactly once semantics end-to-end
• Different outputmodes (complete,delta, update-in-place)
• ML algorithms will be updated too
What can we do with this that’s hard
with other engines?
Ad-hoc, interactive queries
Dynamic changing queries
Benefits of Spark: elastic scaling, stragglermitigation, etc
Use Case: Fraud Detection
STREAM
ANOMALY
Machine LearningModel
continuously updates
to detectnew anomalies
Analyze Historic Data
Timeline
Spark 2.0
• API foundation
• Kafka, file systems, and
databases
• Event-time aggregations
Spark 2.1 +
• Continuous SQL
• BI app integration
• Other streaming sources/ sinks
• Machine learning
Thank you.
@rxin

The Future of Real-Time in Spark