Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data Troika

Dean Wampler (Typesafe), Patrick Di Loreto (William Hill)
Cassandra, Spark
and Kafka:
The Streaming Data Troika

2
About Typesafe
Typesafe Reactive Platform
• Akka, Play, and Spark, for Scala and Java.
• typesafe.com/reactive-big-data

3
What’s Reactive?
Responsive
Elastic Resilient
Message Driven

4
About
Online Sportsbook and Gaming provider
• Every day we push more than 5
millions price changes
• 160TB of data flowing through our
platform each day

We're
Hiring 
https://careers.williamhill.com
WH Apple Watch App Interactive Scoreboard Virtual Reality Horse Race 
Oculus Rift

7
Big Data Circa 2010
Generally two camps. One was the oﬄine, batch-mode processing of massive data sets done with Hadoop.

8
Big Data Circa 2010
Akka
The other was the online, real-time processing and storage of data of “transactional” data at scale, as exempliﬁed by Cassandra for the data store and middleware tools
and libraries like Akka, Spring, etc.

9
Big Data Circa 2010
Akka
?
Two camps together with some overlap and connectivity, but not a lot.

11
Big Data Circa 2015
We still have this:
Akka
?
Five years later (this year), we still have these architectures in wide use, but…

12
Big Data Circa 2015
But now we have this:
Big Data
Streaming
Mesos, EC2, or Bare
A new, streaming-oriented architecture is emerging, which can also be used for batch mode analysis, if we process resident data sets as ﬁnite streams.

Topic A
General Principles
• Spark Streaming: Analytics/aggregations
• C*: Storage, queries
• Kafka: durable message store; allows
replay of messages lost downstream.
Spark Streaming provides rich analytics.

Need a durable system of record, like Kafka, which allows repeat reads in case of loss. See https://medium.com/@foundev/real-time-analytics-with-spark-streaming-and-
cassandra-2f90d03342f7 for a nice summary of design patterns and tips.

Mesos, EC2, or Bare Metal
14
Let’s explore this.

15
Cassandra remains the ﬂexible, scalable datastore suitable for scalable ingesting of streaming data, such as event streams (e.g., click streams from web apps) and logs.

16
Kafka is growing popular as a tool for durable ingestion of diverse event streams with partitioning for scale and organization into topics (like a typical message queue) for
downstream consumers.

Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N * M links ConsumersProducers
One use of Kafka is to solve the problem of N*M direct links between producers and consumers. This is hard to manage and it couples services to directly, which is
fragile when a given service needs to be scaled up through replication or replacement and sometimes in the protocol that both ends need to speak.

Service 1
Log &
Other Files
Internet
Services
Service 2
Service 3
Services
Services
N + M links ConsumersProducers
So Kafka can function as a central hub, yet it’s distributed and scalable so it isn’t a bottleneck or single point of failure.

n+5
n+4
n+3
n+2
n+1
n
Consumer 1
Producer 1
Producer 2
n+?
n+?
Consumer 2
Kafka Usage
Topic A
The message queue structure looks basically like this. Where diﬀerent producers can write to append messages to a topic and diﬀerent consumers can read the
messages in the queue at their own pace, in order.

Kafka Resiliency
Data loss downstream? Can replay lost
messages.
Could use C* for this, but then you’ve changed the read/write load (and hence tuning, scaling, etc. of your C* ring).

21
The third element of the “troika” is Spark, the next generation, scalable compute engine that is replacing MapReduce in Hadoop. However, Spark is flexible enough to run
in many cluster configurations, including a local mode for development, a simple standalone cluster mode for simple scenarios, Mesos for general scalability and
flexibility, and integrated with Cassandra itself.

Topic A
Spark Streaming Dos/Don’ts
Do
• Use for rich analytics and aggregations.
• Use with Kafka/C* source if data loss not
tolerable. Or, use the write ahead log
(WAL) - less optimal.
Spark Streaming oﬀers rich analytics, even SQL, machine learning, and graph representations. It’s a more complex engine, so there is more “room” for data loss. Hence,
use Kafka or C* for durability and replay capabilities, but if you do ingest data directly from other sources without replay capability, at least use the WAL.

Topic A
Spark Streaming Don’ts
Don’t
• Use for counting (use C*).
• Low-latency, per-event processing.
C* is faster and more accurate for counting, because repeat execution of Spark tasks (for error recovery, speculative execution, etc.) will cause over-counting (e.g., using
the “aggregator” feature). Also, Spark is a mini-batch system, for processing time slices of events (down to ~1 sec.). If you need low-latency and/or per-event processing,
use Akka…

24
Other parts of complete infrastructure include a distributed ﬁle system like CSFv2, when you don’t need a full database, e.g., for logs that you’ll dump into the ﬁle system
and then process in batches later on with Spark.

25
Typesafe Reactive Platform provides infrastructure tools for integrating these and other components, including Akka Streams for resilient, low-latency event processing
(based on the Reactive Streams standard for streams with dynamic back pressure), ConductR for orchestrating services, and Play for web services and consoles.

Topic A
Typesafe Reactive Platform
• Akka Streams: low-latency, per-event
processing.
• ConductR for orchestrating services.
• Play for web services, consoles.
• … and commercial Spark support.
Akka Streams implements the Reactive Streams standard for streams with dynamic back pressure. It sits on top of the more general Akka Actor framework for highly
distributed concurrent applications.

Typesafe oﬀers commercial support for development teams developing advanced Spark applications. We oﬀer production runtime support for Spark running on Mesos
clusters.

27
Finally, there’s a wealth of cluster systems possible. You could deploy these tools on your servers for you Cassandra Ring, which has an excellent integration with Spark.
You can run in EC2 or bare metal. You can use a general-purpose cluster management system like Mesos.

Presented by Patrick Di Loreto
R&D Engineering Lead
Site: https://developer.williamhill.com
Twitter: https://twitter.com/patricknoir
OMNIA 
 
Distributed & Reactive  
platform for data management

Motivations
29Omnia: Distributed & Reactive platform for data management
Users
Feeds
System
3
Party
In order to be in a position to innovate we need to control and
understand our data
Social

Networks
IoT
William Hill
Need
for
control
over
the
data

DMP based on the Lambda architecture and the Reactive principles
What is Omnia?
30
Chronos
DataSource
NeoCortex
Speed Layer
Fates
Batch Layer
Hermes
ServingLayer
Data Flow
Input Output
Omnia: Distributed & Reactive platform for data management
Lambda
architecture

Reactive principles
31
Responsive
Resilient
Message Driven
Elastic
The Reactive Manifesto
http://www.reactivemanifesto.org/
Reactive
Manifesto

Chronos is a reliable and scalable component which collect data from different
sources and organize them into Streams of observable events.
Chronos: Data acquisition
32
Incident: {
type: “bet”,
version: “1.0”,
time: “2015-09-03 06:00:10”,
acquisitionTime: “2015-09-03 06:00:06”,
source: “BetSystem”,
payload: {…. Any valid JSON}
}
Chronos
DataSource
TCP
HTTP
WS
…
JMS
HTTP
Poll
SSE
Adapter
Streams
Converter Persistence
BetsDeposits
Prices
Stream = Adapter + Converter + Persistence

Chronos: Data acquisition
Chronos 1
(SSE, Bets placed)
Chronos 2
(JMS, Deposits)
Chronos 3
(HTTP, Events)
Chronos N
(SSE, Twitter)
….…
Chronos 2
(JMS, Deposits)
(SSE, Bet Placed)

High throughput distributed messaging system
• Highly Availability
• Efficiency
• Durable
Chronos: Why Kafka
Kafka
is
a
high-‐throughput
distributed
messaging
system

Design
Principles:

Highly
Available:
Replicated
Distributed

High
throughput:
Stateless
Broker

Efficiency:

Disk
Efficiency
:
“Don’t
fear
the
file
system”
–
modern
OSs
optimize
sequential
disk
operations/disk
caching
strategy

Usage
of
OS
filesystem
cache
rather
than
application
level
cache:

More
efficient
(no
usage
of
GC)

Survive
on
application
restart

I/O
Efficiency
:
Batching
–
Reduces
small
I/O
operations,
this
mortize
network
roundtrip
overhead,
enhance
larger
sequential
disk
operations

Durable

Fates represents the long term memory of Omnia. It organizes the incidents that
Chronos collected into timelines and also elaborates new information as views by
using machine learning, logical reasoning and time series analysis.
Fates: Batch layer
Customer: 123
Login
Deposit
Bet placed
…
Logout
Event: 78
Started
Fault
Penalty
…
Goal
Timelines & Views
Bets Deposits
Events Session
Fates
Batch Layer

Fates: Batch layer
Timelines
Views
Jobs
Fates

Fates: Cassandra
Cassandra is the long term storage for our data.
• Highly Available (CAP)
• Linear Scalability
• Multi DC – Separation of Concerns (Production and Analytic DCs)
• High performance and optimal for WRITE operations

NeoCortex represents the short term memory of Omnia. It offers a framework to
develop micro services on top of Apache Spark. It performs fast and real time data
processing with the data acquired from Chronos and Fates.
NeoCortex: Speed layer
NeoCortex
BetsDeposits
EventsSession
Micro Services
Output

Hermes is a scalable and full duplex communication for B2C and B2B.
Hermes: Serving Layer
B2C
Browser
B2B
Loadbalancer
Push
Server
Distribute
Cache
Push
Server
Push
Server
…
TCP
WS
HTTP
JSAPI
WH
Apps
Cache
Cache
Apps

Custom advert, bonus, data load prediction, bot detection...
Omnia Data Flow
40
Chronos
DataSource
NeoCortex
Speed Layer
Fates
Batch Layer
Hermes
ServingLayer
Input Output
Users become a new data producer

Real time monitoring and elasticity
Docker and Mesos: Scale In&Out based on demand,
Omnia on Omnia
41
Chronos
DataSource
NeoCortex
Speed Layer
Fates
Batch Layer
Hermes
ServingLayer
Input Output
JMX
JMX
JMX

Omnia infrastructure
Omnia
Docker
Marathon
Mesos
Node Node NodeNodeNode

Thank you
careers.williamhillplc.com
omnia.williamhill.com/
`typesafe.com/reactive-big-data

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data Troika

More Related Content

What's hot

Viewers also liked

Similar to Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data Troika

More from DataStax Academy

Recently uploaded

Typesafe & William Hill: Cassandra, Spark, and Kafka - The New Streaming Data Troika