codecentric AG: Using Cassandra and Clojure for Data Crunching backends

Do you see the elephant
being swallowed by the snake?

+---------------+---------------+
| timestamp | sequenceId |
+---------------+---------------+

Usedtoavoidtimestampresolutioncollisions
Toensuresub-resolutionorder
Snapshotthedataonoverflowortimeout
Ensuresidempotence
SequenceID

ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
RangeTables

FullTableScan
Start End

OpenRange
Start End

“Between”Range
Start End

(richqueryAPI)
Step2
addsomealgebra

StreamFusion
for
richad-hocqueries

data Step data cursor = Yield data !cursor
| Skip !cursor
| Done
data Stream data =
∃s. Stream (cursor → Step data cursor) cursor

StreamBeginning:
readingfromtheDB

map
Yield data cursor → Yield (f cursor) cursor
Skip cursor → Skip cursor
Done → Done
maps :: (a → b) → Stream a → Stream b

filter
Yield data cursor | p data → Yield data cursor
| otherwise → Skip cursor
Skip cursor → Skip cursor
Done → Done
filters :: (a → Bool) → Stream a → Stream a

reduce/fold
Yield x cursor → loop (f data x) cursor
Skip cursor → loop data cursor
Done → z
foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc

Append
class Monoid a where
mempty :: a
mappend :: a -> a -> a
-- ^ Identity of 'mappend'
-- ^ An associative operation

class (Monoid intermediate) =>
Aggregate intermediate end
where
combine :: intermediate -> end
Combine

data Count = Count Int
instance Monoid Count where
mempty = Count 0
mappend (Count a) (Count b) = Count $ a + b
instance Aggregate Count Int where
combine (Count a) = a
CountExample

CREATE TABLE support_vectors(
path varchar,
alpha list<double>,
phi int,
PRIMARY KEY(path))

Problems
Highdeserialisationoverhead
NeedtoaddPKspecifiersformultipleSVs

0 8 16 24 32 40 n*8
+----+----+----+----+----+----+----+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+----+----+----+
byte address
points 1 2 3 40 n

0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
n*8+ 0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
m*n*8+ 0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
m1 m2 m3 m4m0 mn

Advantages
“Ascompactasitgets”representation
Smallerserialisationoverhead
Fastrelativeaccess
Easytogomulti-dimensional
Easytoimplementatomicin-memoryoperations

P(X|blue)=
NumberofBluenearX
Totalnumberofblue
P(X|red)=
NumberofRednearX
TotalnumberofRed

[[Mean(x1), Var(x1)]
[Mean(x2), Var(x3)]
...
[Mean(xn), Var(xn)]]

0 8 16
+---------+---------+
| Mean(x )| Var(x ) |
+---------+---------+
0 0
16 24 32
+---------+---------+
+---------+---------+
1 1
2n*8 (2n+1)*8
+---------+---------+
+---------+---------+
n n
byte address
payloads

Advantages
“Ascompactasitgets”representation
Smallerserialisationoverhead
Fastrelativeaccess

BloomFilters
arebasicallylongarrays/vectors

0 8
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
8 16
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
16 24
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
24 32
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
...
bit address

Advantages
64bitsper8-byteLong
Easytorepresentbythelong-arrayusing
offsets, bitshiftsandmasks

Count-minsketches
arebasicallyintmatrices

Histograms
arebasicallylongvectors

0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400
1n
byte address
byte address
Longs (counts)
Doubles (bin start number)

Conclusions
Ad-hocqueries
Parallelism
LightweightDSsrepresentation
OptimisationsandgoodAPIfits

@ifesdjeen
http://bit.ly/cassandrasummit2015

codecentric AG: Using Cassandra and Clojure for Data Crunching backends

More Related Content

What's hot

Viewers also liked

Similar to codecentric AG: Using Cassandra and Clojure for Data Crunching backends

More from DataStax Academy

Recently uploaded

codecentric AG: Using Cassandra and Clojure for Data Crunching backends