NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

NoLambda: Combining Streaming,
Ad-Hoc, Machine Learning, and
Batch Analytics

andEvan Chan Helena Edelson
March 2016

Evan Chan
Distinguished Engineer,
User and contributor to Spark since 0.9, Cassandra since
0.6
Co-creator and maintainer of
Tuplejump
@evanfchan
http://velvia.github.io
Spark Job Server

Helena Edelson
|@helenaedelson github.com/helena
VP of Product Engineering,
Cloud Engineer, Big Data, Event-Driven systems
Committer: Kafka Connect Cassandra, Spark Cassandra
Connector
Contributor: Akka, Spring Integration
Speaker: Kafka Summit, Spark Summit, Strata, QCon, Scala
Days, Scala World, Philly ETE
Tuplejump

Tuplejump
is a big data technology leader providing solutions and
development partnership.
Tuplejump

Open Source: on GitHubTuplejump
- Distributed Spark + Cassandra analytics database
- Kafka-Cassandra Source and
Sink
- The rst Spark Cassandra integration
- Lucene indexer for Cassandra
- HDFS for Cassandra
FiloDB
Kafka Connect Cassandra
Calliope
Stargate
SnackFS

Tuplejump Consulting & Development

Topics
Modern streaming and batch/ad-hoc architectures
Pitfalls and Simpli cation
Precise and scalable streaming ingestion
FiloDB and fast analytics with competitive storage
cost
Machine learning with Spark, Cassandra, and FiloDB

Delivering Meaning
Derived From
Many data streams
Disparate sources and schemas
Originating from many
locations

The Problem Domain
Build scalable, adaptable, self-healing, distributed data
processing systems for
24 / 7 Uptime
Auto scale out
Complex analytics and learning tasks
Aggregate global data
Ops for global, multi-dc clustered data
ows
Decoupled services
Idempotent & Acceptable Consistency
No data loss

Factors & Constraints in Architecture
Corner cases, Industry use cases
Legal constraints - user data crossing
borders
% Writes on ingestion, % Reads
How much in memory?
Counters? Geo-locational?
JDK version for cloud deployments
Dimensions of data in queries
What needs real time feedback loops
The list goes on...

Need Self-Healing Systems
Massive event spikes & bursty traf c
Fast producers / slow consumers
Network partitioning & out of sync
systems
DC down
Not DDOS'ing ourselves from fast streams
No data loss when auto-scaling down
Monitor Everything
Everything fails, all the time

Use Case
I need fast access to historical data on the y for predictive
modeling with real time data from the stream

Only, It's Not A Stream It's A Flood
Trillions of event writes per day
Billions of event reads per day
Massive events per second at
peak
Petabytes of total streaming data

Not All Streams Are Created The Same
Daily, Hourly, Frequency, Event Spikes at Peak, Overall Volume
Sub-second, low latency stream
processing
Higher latency stream processing
Scheduled or on request batch processing

Real Time
Just means Event Driven or processing events as they arrive
Doesn't automatically equal sub-second latency
requirements
Event Time
When an event is created, e.g. on sensor
Events should be uniquely timestamped on ingestion for
tracking, metrics and replay

Based on the schema of data in a given stream
Some can aggregate with sliding windows (T1...Tn,Tn+1...) using
window length + slide interval:
stream.reduceByKeyAndWindow((a:Int,b:Int)=>(a+b),Seconds(30),Seconds(10))
Some must aggregate by buckets
/Event_Type/YYYY/MM/DD/HH/MM/...
CREATETABLEtimeseries.raw_data_fu(
sensor_idtext,yearint,monthint,dayint,hourint,...
PRIMARYKEY((sensor_id),year,month,day,hour)
)WITHCLUSTERINGORDERBY(yearDESC,monthDESC,dayDESC,hourDESC);

Stream Processing
Kafka - Foundation to streaming
architecture
Samza - Just streaming
Gearpump - Real-time big data streaming
Analytics Stream Processing
Storm - Real-time Analytics, ML, needs Trident to
stream
Flink - Real-time Analytics, ML, Graph
Spark Streaming - Micro-batch Analytics, ML, Graph

Legacy Infrastructure
Highly-invested-in existing architecture around
Hadoop
Existing analytics logic for scheduled MR jobs

Lambda Architecture
A data-processing architecture designed to handle massive quantities
of data by taking advantage of both batch and stream processing
methods.

Lambda Architecture
( )https://www.mapr.com/developercentral/lambda-architecture

λ The Good
Immutability - retaining master data
With timestamped events
Appended versus overwritten
events
Attempt to beat CAP
Pre-computed views for
further processing
faster ad-hoc querying

λ The Bad
Two Analytics systems to support
Operational complexity
By the time a scheduled job is run 90% of the data is stale
Many moving parts: KV store, real time platform, batch
technologies
Running similar code and reconciling queries in dual systems
Analytics logic changes on dual systems

λ The Overly Complicated
Immutable sequence of records is ingested and fed into
a batch processing system
and a stream processing
system
in parallel
Ultimately Very High TCO And...

Are Batch and Streaming Systems
Fundamentally Different?
Both accumulate events from *T1...Tn,
Tn+1...Tn+x,...*
Or bucketed by Year-Month-Day-Hour?
Streaming frameworks have schedulers

A Unified Streaming Architecture
Everything On The Streaming Platform
Scala / Spark
Streaming
Mesos
Akka
Cassandra
Kafka

High Throughput Distributed Messaging
High Scalability - billions of events per
day
Durability - no data loss
Immutability
Support Massive Number of Consumers
Very ef cient and low latency
Decouples Data Pipelines
Automatic recovery from broker failures

Stream Processing Simplified
Kafka Streams
In master, coming in v0.10
Removes the need to run another framework like Storm
alongside Kafka
Removes the need for separate infrastructures
Common stream operations, e.g. join, lter, map, etc.
Windowing
Proper time modeling, e.g. event time vs. processing time
Local state management with persistence and replication
Schema and Avro support

Spark Streaming
Iterative ML, Interactive Querying, Graph, DataFrames
One runtime for streaming and batch processing
Join streaming and static data sets
No code duplication
Easy Kafka stream integration
Easy to reconcile queries against multiple
sources
Easy integration of KV durable storage

Apache Cassandra
Horizontally scalable
Multi-Region / Multi-Datacenter
Always On - Survive regional outages
Extremely fast writes: - perfect for ingestion of real time /
machine data
Very exible data modelling (lists, sets, custom data types)
Easy to operate
Best of breed storage technology, huge community
BUT: Simple queries only
OLTP-oriented/center

High performance concurrency framework for Scala and
Java
Fault Tolerance
Asynchronous messaging and data processing
Parallelization
Location Transparency
Local / Remote Routing
Akka: Cluster / Persistence / Streams

Enables
Streaming and Batch In One System
Streaming ML and Analytics for Predictions In The Stream

Kafka Streams
valbuilder=newKStreamBuilder()
valstream:KStream[K,V]=builder.stream(des,des,"raw.data.topic")
.flatMapValues(value->Arrays.asList(value.toLowerCase.split("")
.map((k,v)->newKeyValue(k,v))
.countByKey(ser,ser,des,des,"kTable")
.toStream()
stream.to("results.topic",...)
valstreams=newKafkaStreams(builder,props)
streams.start()

https://github.com/con uentinc/demos

Spark Streaming Kafka
Immutable Raw Data From Kafka Stream
Replaying data streams: for fault tolerance, logic changes..
classKafkaStreamingActor(ssc:StreamingContext)extendsMyAggregationActor{ 
valstream=KafkaUtils.createDirectStream(...) .map(RawWeatherData(_))
stream
.foreachRDD(_.toDF.write.format("filodb.spark")
.option("dataset","rawdata").save())
/*Pre-Aggregatedatainthestreamforfastqueryingandaggregationlater.*/
stream.map(hour=>
(hour.wsid,hour.year,hour.month,hour.day,hour.oneHourPrecip)
).saveToCassandra(timeseriesKeyspace,dailyPrecipTable) 
}

Reading Data From Cassandra On Request, Further
Aggregation
Compute isolation in Akka Actor
classTemperatureActor(sc:SparkContext)extendsAggregationActor{ 
importakka.pattern.pipe
defreceive:Actor.Receive={ 
casee:GetMonthlyHiLowTemperature=>highLow(e,sender) 
}  
defhighLow(e:GetMonthlyHiLowTemperature,requester:ActorRef):Unit= 
sc.cassandraTable[DailyTemperature](timeseriesKeyspace,dailyTempAggregTable) 
.where("wsid=?ANDyear=?ANDmonth=?",e.wsid,e.year,e.month)
.collectAsync() 
.map(MonthlyTemperature(_,e.wsid,e.year,e.month))pipeTorequester
}

Spark Streaming, MLLib
Kafka, Cassandra
valssc=newStreamingContext(sparkConf,Seconds(5) )
valtestData=ssc.cassandraTable[String](keyspace,table)
.map(LabeledPoint.parse)
valtrainingStream=KafkaUtils.createDirectStream[_,_,_,_](..)
.map(transformFunc)
trainingStream.saveToCassandra("ml_training_keyspace","raw_training_data") 
 valmodel=newStreamingLinearRegressionWithSGD() 
.setInitialWeights(Vectors.dense(weights)) 
.trainOn(trainingStream)
model
.predictOnValues(testData.map(lp=>(lp.label,lp.features)))
.saveToCassandra("ml_predictions_keyspace","predictions")

What's Missing? One Pipeline For Fast +
Big Data

Using Cassandra for Batch Analytics /
Event Storage / ML?
Storage ef ciency and scan speeds for reading large volumes
of data (for complex analytics, ML) become important
concerns
Regular Cassandra CQL tables are not very good at either
storage ef ciency or scan speeds
A different, analytics-optimized solution is needed...

All hard work leads to pro t, but mere talk leads
to poverty.
- Proverbs 14:23

Introducing FiloDB
A distributed, versioned, columnar analytics database.
Built for Streaming.

github.com/tuplejump/FiloDB

Fast Analytics Storage
Scan speeds competitive with Apache Parquet
Up to 200x faster scan speeds than with Cassandra 2.x
Flexible ltering along two dimensions
Much more ef cient and exible partition key ltering
Ef cient columnar storage, up to 40x more ef cient than
Cassandra 2.x

Comparing Storage Costs and Query Speeds
https://www.oreilly.com/ideas/apache-cassandra-for-analytics-
a-performance-and-storage-analysis

Robust Distributed Storage
Apache Cassandra as the rock-solid storage engine. Scale out
with no SPOF. Cross-datacenter replication. Proven storage and
database technology.

Cassandra-Like Data Model
Column A Column B
Partition
key 1
Segment
1
Segment
2
Segment
1
Segment
2
Partition
key 2
Segment
1
Segment
2
Segment
1
Segment
2
partition keys - distributes data around a cluster, and allows
for ne grained and exible ltering
segment keys - do range scans within a partition, e.g. by time
slice
primary key based ingestion and updates

Flexible Filtering
Unlike Cassandra, FiloDB offers very exible and ef cient
ltering on partition keys. Partial key matches, fast IN queries on
any part of the partition key.
No need to write multiple tables to work around answering different
queries.

Spark SQL Queries!
CREATETABLEgdeltUSINGfilodb.sparkOPTIONS(dataset"gdelt");
SELECTActor1Name,Actor2Name,AvgToneFROMgdeltORDERBYAvgToneDESCLIMIT15
INSERTINTOgdeltSELECT*FROMNewMonthData;
Read to and write from Spark Dataframes
Append/merge to FiloDB table from Spark
Streaming
Use Tableau or any other JDBC tool

What's in the name?
Rich sweet layers of distributed, versioned database goodness

SNACK (SMACK) stack for all your
Analytics
Regular Cassandra tables for highly concurrent, aggregate /
key-value lookups (dashboards)
FiloDB + C* + Spark for ef cient long term event storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classi ed / predicted / scored data

Being Productionized as we speak...
One enterprise with many TB of nancial and reporting data is
moving their data warehouse to FiloDB + Cassandra + Spark
Another startup uses FiloDB as event storage, feeds the events
into Spark MLlib, scores incoming data, then stores the results
back in FiloDB for low-latency use cases
From their CTO: “I see close to MemSQL / Vertica or even
better” “More cost effective than Redshift”

FiloDB Use Cases
Data Warehousing / BI
< 10 second SLA, nontrivial reports, some concurrency
need to store and query lots of data ef ciently
Time series
idempotent write API, simultaneous write and read
workloads
In-memory SQL web server
700 queries per second using in-memory column store

FiloDB vs HDFS/Parquet
FiloDB Parquet
Ingestion Idempotent primary-key
based; appends and
replaces; deletes coming
File-based
append
API only
Filtering Partition-key and segment-
key ltering
Mostly
le-based
Scan
speeds
Parquet-like Good for
OLAP
Storage
cost
Within 35% of Parquet

FiloDB vs HDFS/Parquet
In practice, with good data modeling, FiloDB is a far better t for
low-latency / concurrent BI / reporting / dashboard applications.

FiloDB vs Druid
Different use cases:
Druid is optimized mostly for OLAP cube / slice and dice
analysis. Append only, keeps only aggregates, not a raw event
store.
FiloDB stores raw data - can be used to build ML models,
visualize and analyze raw time series data, do complex event
ow analysis - much more exible
FiloDB can update/replace data
FiloDB does not require data denormalization - can handle
traditional BI star schemas with slowly changing dimension
tables

Come check out the demo!
Visit FiloDB at the Developer Showcase (Expo hall) today!

Machine Learning with Spark, Cassandra,
and FiloDB

Building a static model of NYC Taxi Trips
Predict time to get to destination based on pickup point, time
of day, other vars
Need to read all data (full table scan)

Dynamic models are better than static
models
Everything changes!
Continuously re ne model based on recent streaming data +
historical data + existing model

valssc=newStreamingContext(sparkConf,Seconds(5) )
valdataStream=KafkaUtils.createDirectStream[..](..)
.map(transformFunc)
dataStream.foreachRDD(_.toDF.write.format("filodb.spark")
.option("dataset","training").save())
if(trainNow){
 varmodel=newStreamingLinearRegressionWithSGD() 
.setInitialWeights(Vectors.dense(weights)) 
.trainOn(dataStream.join(historicalEvents))
}
model.predictOnValues(dataStream.map(lp=>(lp.label,lp.features)))
.insertIntoFilo("predictions")

The FiloDB Advantage for ML
Able to update dynamic models based on massive data
ow/updates
Integrate historical and recent events to build models
More data -> better models!
Can store scored raw data / predictions back in FiloDB
for fast user queries

FiloDB - Roadmap
Your input is appreciated!
Productionization and automated stress testing
Kafka input API / connector (without needing Spark)
In-memory caching for signi cant query speedup
True columnar querying and execution, using late
materialization and vectorization techniques. GPU/SIMD.
Projections. Often-repeated queries can be sped up
signi cantly with projections.

Thanks For Attending!
@helenaedelson
@evanfchan
@tuplejump

What are my storage needs?
Non-persistent / in-memory: concurrent
viewers
Short term: latest trends
Longer term: raw event and aggregate storage
ML Models, predictions, scored data

Spark RDDs
Immutable, cache in memory and/or on
disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of data
Snapshotting for recovery

Using Cassandra for Short Term Storage
1020s 1010s 1000s
Bus A Speed, GPS
Bus B
Bus C
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single bus for a range of
time
Can also query most recent location + speed of all buses
(slower)

Scenarios
BI Reporting, concurrency + seconds latency
Ad-hoc queries
Needing to do JOINs with fact tables + dimension
tables
Slowly changing dim tables / hard to denormalize
Need to work with legacy BI tools

Real-world DW Architecture Stack
Ef cient columnar storage + ltering = low latency BI

Modeling Fact Tables for FiloDB
Single partition queries are really fast and take up only one
thread
Given the following two partition key columns:
entity_number, year_month
WHERE entity_number = '0453' AND
year_month = '2014 December'
Exact match for partition key is pushed down as one
partition
Consider the partition key carefully

Cassandra often requires multiple tables
What about the queries that do not translate to one partition?
Cassandra has many restrictions on partition key ltering (as of
2.x).
Table 1: partition key = (entity_number, year_month)
Can push down: WHERE entity_number = NN AND
year_month IN ('2014 Jan', '2014 Feb')as
well as equals
Table 2: partition key = (year_month, entity_number)
Can push down: WHERE year_month = YYMM AND
entity_number IN (123, 456)as well as equals
IN clause must be the last column to be pushed down. Two tables
are needed just for ef cient IN queries on either entity_number
or year_month.

FiloDB Flexible Partition Filters = WIN
With ONE table, FiloDB offers FAST, arbitrary partition key
ltering. All of the below are pushed down:
WHERE year_month IN ('2014 Jan', '2014 Feb')
(all entities)
WHERE entity_number = 146(all year months)
Any combo of =, IN
Space savings: 27 *2 = 54x

Multi-Table JOINs with just Cassandra

Sub-second Multi-Table JOINs with FiloDB

Sub-second Multi-Table JOINs with FiloDB
Four tables, all of them single-partition queries
Two tables were switched from regular Cassandra tables to
FiloDB tables. 40-60 columns each, ~60k items in partition.
Scan times went down from 5-6 seconds to < 250ms
For more details, please see this .Planet Cassandra blog post

Scalable Time-Series / Event Storage with
FiloDB

Designed for Streaming
New rows appended via Spark Streaming or Kafka
Writes are idempotent - easy exactly once ingestion
Converted to columnar chunks on ingest and stored in
C*
FiloDB keeps your data sorted as it is being ingested

Spark Streaming -> FiloDB
valratingsStream=KafkaUtils.createDirectStream[String,String,StringDecoder,Strin
ratingsStream.foreachRDD{
(message:RDD[(String,String)],batchTime:Time)=>{
valdf=message.map(_._2.split(",")).map(rating=>Rating(rating(0).trim.toInt,r
toDF("fromuserid","touserid","rating")
//addthebatchtimetotheDataFrame
valdfWithBatchTime=df.withColumn("batch_time",org.apache.spark.sql.functions.l
//savetheDataFrametoFiloDB
dfWithBatchTime.write.format("filodb.spark")
.option("dataset","ratings")
.save()
}
}
One-line change to write to FiloDB vs Cassandra

Modeling example: NYC Taxi Dataset
The public contains telemetry (pickup, dropoff
locations, times) info on millions of taxi rides in NYC.
NYC Taxi Dataset
Medallion pre x 1/1 - 1/6 1/7 - 1/12
AA records records
AB records records
Partition key - :stringPrefix medallion 2- hash
multiple drivers trips into ~300 partitions
Segment key - :timeslice pickup_datetime 6d
Row key - hack_license, pickup_datetime
Allows for easy ltering by individual drivers, and slicing by time.

DEMO TIME
New York City Taxi Data Demo (Spark Notebook)
To follow along:
https://github.com/tuplejump/FiloDB/blob/master/doc/FiloDB_Taxi_G

Fast, Updatable In-Memory
Columnar Storage
Unlike RDDs and DataFrames, FiloDB can ingest new data, and
still be fast
Unlike RDDs, FiloDB can lter in multiple ways, no need for
entire table scan
FAIR scheduler + sub-second latencies => web speed queries

700 Queries Per Second in Apache Spark!
Even for datasets with 15 million rows!
Using FiloDB's InMemoryColumnStore, single host / MBP,
5GB RAM
SQL to DataFrame caching
For more details, see .this blog post

Multiple ways to Accelerate Queries
Columnar projection - read fewer columns, saves I/O
Partition key ltering - read less data
Sort key / PK ltering - read from subset of keys
Possible because FiloDB keeps data sorted
Versioning - write to multiple versions, read from the one you
choose

Cassandra CQL vs Columnar Layout
Cassandra stores CQL tables row-major, each row spans multiple
cells:
PartitionKey 01: rst 01:last 01:age 02: rst 02:last 02:age
Sales Bob Jones 34 Susan O'Connor 40
Engineering Dilbert P ? Dogbert Dog 1

Columnar layouts are column-major:
PartitionKey rst last age
Sales Bob, Susan Jones,
O'Connor
34,
40
Engineering Dilbert,
Dogbert
P, Dog ?, 1

FiloDB Cassandra Schema
CREATETABLEfilodb.gdelt_chunks(
partitiontext,
versionint,
columnnametext,
segmentidblob,
chunkidint,
datablob,
PRIMARYKEY((partition,version),columnname,segmentid,chunkid)
)WITHCLUSTERINGORDERBY(columnnameASC,segmentidASC,chunkidASC)

FiloDB Architecture
ColumnStore API - currently Cassandra and InMemory, you can
implement other backends - ElasticSearch? etc.

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

More Related Content

What's hot

Similar to NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis

More from Helena Edelson

Recently uploaded

NoLambda: Combining Streaming, Ad-Hoc, Machine Learning and Batch Analysis