big DATA
mob SCALE
JAX London 2013 - Darach Ennis - @darachennis
small FAST
DATA guy
JAX London 2013 - Darach Ennis - @darachennis
Big Data!
!
!

“The techniques and technologies for such dataintensive science are so different that it is
worth distinguishing data-intensive science from
computational science as a new, fourth paradigm”
!

- Jim Gray!
!
!

The Fourth Paradigm: Data-Intensive Scientific Discovery. - Microsoft 2009
DATA intensive!
science SCALE
Compute Sympathy
Compute Sympathy
Compute Sympathy
A Wall Street Second
A Swiss Second
Small Data? <= 128bytes
HTTP GET/POST - A typical RESTful performance
Req/Sec

Bw/Sec (MB)
12,616

Avg Latency (ms)
14,642

15,499

Max Latency (ms)
15,787

15,445

1000

Stdev (ms)

15,330

15,173

14,998

8,705
3,907

4,279

100

1000

10

100

1
10

1

0.1
1

2

4

8

16

32

64

Concurrent Connections

128

256

512

1024
Small Data? <= 1K
Req/Sec

Bw/Sec (MB)
Avg - A typical RESTfulLatency (ms)
Max performance Stdev (ms)
HTTP GET/POST Latency (ms)

10000

1000

1,288

1,951

2,722 2,849 2,790 2,858 2,916 2,830 2,788 2,842

690

100

100

10

1

1

0.1
1

2

4

8

16

32

64

128

Concurrent Connections

256

512

1024
Big Events - 1Billion Sources
Ballpark number of boxes if each box can handle 2500 events/second
1000000

1/dy

1/hr

1/mn

1/sc

400,000
40,000
Value Axis

16,667
4,000
1,667

1000

167
17
1

1

112

35

1

1/dy 1/hr 1/mn 1/sc
1 million

12
1

2

1

1/dy 1/hr 1/mn 1/sc
10 million

1/dy 1/hr 1/mn 1/sc
100 million

Category Axis

5
1/dy 1/hr 1/mn 1/sc
1 billion
Data!
Sympathy?
5 V's
5 V’s via [V-PEC-T]
•

Business Factors
•
•

•

‘Veracity’ - The What
‘Value’ - The Why

Technical Domain (Policies, Events, Content)
•

Volume, Velocity, Variety
Source: Ashwani Roy, Charles Cai - QCON London 2013 - http://bit.ly/1f2Pdf9
Source: Ashwani Roy, Charles Cai - QCON London 2013 - http://bit.ly/1f2Pdf9
Source: Ashwani Roy, Charles Cai - QCON London 2013 - http://bit.ly/1f2Pdf9
Incremental!
!

The needs of the individual event or query
outweigh the needs of the aggregate events
or queries in flight in the system
!
!
!
Batch!
!

The needs of the system outweigh the needs
of individual events and queries running in
flight or active within the system
!
!
!
“Computing arbitrary functions on an arbitrary
dataset in real time is a daunting problem..”

- Nathan März
Lambda Architecture
“Twitter Scale”
5000 msgs/second inbound
<1K “Small data”
“Firehouse" outbound - but
thats just a broadcast
problem (easy)
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed
Views
Views
Views

Apps
Lambda: A
All new data is sent to both the batch
layer and the speed layer. In the
batch layer, new data is appended to
the master dataset. In the speed
layer, the new data is consumed to
do incremental updates of the
realtime views.
Lambda: B
The master dataset is an immutable,
append-only set of data. The master
dataset only contains the rawest
information that is not derived from
any other information you have.
Lambda: Master data set
•

From A: “rawest … not derived"
•

In many environments it may be preferable to
normalise data for later ease of retrieval (eg:
Dremel, strongly typed nested records) to support
scalable ad hoc query.


•

Derivation allows other forms of efficient retrieval eg:
using SAX - Symbolic Aggregate Approximation,
PAA - Piecewise Aggregate Approximation etc..
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

?

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed
Views
Views
Views

Apps
SAX & PAA

Piecewise Aggregate
Approximation

Symbolic Aggregate
Approximation

1sc -> 1mn -> 1hr -> 1dy -> 1wk -> 1mh -> 1yr
Lambda: C
The batch layer precomputes query
functions from scratch. The results of the
batch layer are called batch views. The
batch layer runs in a while(true) loop and
continuously recomputes the batch views
from scratch. The strength of the batch
layer is its ability to compute arbitrary
functions on arbitrary data. This gives it
the power to support any application.
Lambda: D
The serving layer indexes the batch views
produced by the batch layer and makes it
possible to get particular values out of a
batch view very quickly. The serving layer
is a scalable database that swaps in new
batch views as they’re made available.
Because of the latency of the batch layer,
the results available from the serving layer
are always out of date by a few hours.
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

K/V

Rel

Serving
Web

Data

MQ

"New Data"

?

Apps

Views
Views
Views

Speed
Views
Views
Views

Apps
Think ‘Statistical
Compression'
Lambda: E
The speed layer compensates for the high latency of updates
to the serving layer. It uses fast incremental algorithms and
read/write databases to produce realtime views that are
always up to date. The speed layer only deals with recent
data, because any data older than that has been absorbed
into the batch layer and accounted for in the serving layer.
The speed layer is significantly more complex than the
batch and serving layers, but that complexity is
compensated by the fact that the realtime views can be
continuously discarded as data makes its way through
the batch and serving layers. So, the potential negative
impact of that complexity is greatly limited.
Lambda: http://bit.ly/Hs53Ur
Batch

Time
Series

Docs

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed

?
Views
Views
Views

Apps
Use a DSP + CEP/ESP or
‘Scalable CEP'
•

Storm/S4 + Esper/…
•

Embed a CEP/ESP within a Distributed
Stream processing Engine

•

Use Drill for large scale ad hoc query
[leverage nested records]

•

Already have middleware? Have well
defined queries? Roll your own minimal
EEP (or use mine!)
Lambda: F
Queries are resolved by getting results from both
the batch and realtime views and merging them
together.
Millwheel: http://bit.ly/1gWqNIC

a
St
Queries

Window
Window
Counter
Counter

Model

Web
Query

ts

Model
Model

St
a

ts

Out of
Out of
Trend?
Trend?

Alerts

Monitor

Google’s “Zeitgeist
pipeline"
Lambda: Batch View
•

Precomputed Queries are central to Complex
Event Processing / Event Stream Processing
architectures.

•

Unfortunately, though, most DBMS’s still offer
only synchronous blocking RPC access to
underlying data when asynchronous guaranteed
delivery would be preferable for view
construction leveraging CEP/ESP techniques.
Lambda: Merging …
•

Possibly one of the most difficult aspects of near
real-time and historical data integration is
combining flows sensibly.

•

For example, is the order of interleaving across
merge sources applied in a known
deterministically recomputable order? If not, how
can results be recomputed subsequently? Will
data converge? 




[cf: http://cs.brown.edu/research/aurora/hwang.icde05.ha.pdf]
Lambda: A start …
Batch

Time
Series

Docs

K/V

Rel

Serving
Apps
Web

Data

MQ

Views
Views
Views

"New Data"

Speed
Views
Views
Views

Apps
mob DATA
Not a Jedi
… yet …
JAX London 2013 - Darach Ennis - @darachennis
Thanks.
Questions?
!

@darachennis

Big Events, Mob Scale - Darach Ennis (Push Technology)