@ifesdjeen
Cassandra
Monitoring
Precision
is not same as
Semantics
is not same as
Anomalydetection
Do you see the elephant
being swallowed by the snake?
Agenda
Ad-hoc
queries
Aggregations
Fast
Machine
Learning
parallelqueries
Step1
+---------------+---------------+
| timestamp | sequenceId |
+---------------+---------------+
Usedtoavoidtimestampresolutioncollisions
Toensuresub-resolutionorder
Snapshotthedataonoverflowortimeout
Ensuresidempotence
SequenceID
Fighting
Dispersion
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
RangeTables
FullTableScan
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
OpenRange
Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
“Between”Range
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
Start End
ts1 ts2 ts3 ts4 ts5 ts6 ts7 ts8 ts9 ts10 ts11 ts12 ts13
(richqueryAPI)
Step2
addsomealgebra
StreamFusion
for
richad-hocqueries
Whatiseven
StreamFusion
map
filter
reduce
singlestep
mapFilterReduce
data Step data cursor = Yield data !cursor
| Skip !cursor
| Done
data Stream data =
∃s. Stream (cursor → Step data cursor) cursor
StreamBeginning:
readingfromtheDB
map
Yield data cursor → Yield (f cursor) cursor
Skip cursor → Skip cursor
Done → Done
maps :: (a → b) → Stream a → Stream b
filter
Yield data cursor | p data → Yield data cursor
| otherwise → Skip cursor
Skip cursor → Skip cursor
Done → Done
filters :: (a → Bool) → Stream a → Stream a
reduce/fold
Yield x cursor → loop (f data x) cursor
Skip cursor → loop data cursor
Done → z
foldls :: (Monoid acc) => (acc → a → acc) → acc → Stream a → acc
Append
class Monoid a where
mempty :: a
mappend :: a -> a -> a
-- ^ Identity of 'mappend'
-- ^ An associative operation
class (Monoid intermediate) =>
Aggregate intermediate end
where
combine :: intermediate -> end
Combine
data Count = Count Int
instance Monoid Count where
mempty = Count 0
mappend (Count a) (Count b) = Count $ a + b
instance Aggregate Count Int where
combine (Count a) = a
CountExample
addsomeML
Step3
StoringModels
SupportVectorMachines
Hyperplane
α·x - φ = 1
[ α1 α1 α1 ...αn ] ρ
Option1:
list<double>
CREATE TABLE support_vectors(
path varchar,
alpha list<double>,
phi int,
PRIMARY KEY(path))
Problems
Highdeserialisationoverhead
NeedtoaddPKspecifiersformultipleSVs
Alternative:
blob&bytebuffers
VectorRepresentation
0 8 16 24 32 40 n*8
+----+----+----+----+----+----+----+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+----+----+----+
byte address
points 1 2 3 40 n
MatrixRepresentation
0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
n*8+ 0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
m*n*8+ 0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
m1 m2 m3 m4m0 mn
Advantages
“Ascompactasitgets”representation
Smallerserialisationoverhead
Fastrelativeaccess
Easytogomulti-dimensional
Easytoimplementatomicin-memoryoperations
BayesianClassifiers
P(X|blue)=
NumberofBluenearX
Totalnumberofblue
P(X|red)=
NumberofRednearX
TotalnumberofRed
[[Mean(x1), Var(x1)]
[Mean(x2), Var(x3)]
...
[Mean(xn), Var(xn)]]
0 8 16
+---------+---------+
| Mean(x )| Var(x ) |
+---------+---------+
0 0
16 24 32
+---------+---------+
| Mean(x )| Var(x ) |
+---------+---------+
1 1
2n*8 (2n+1)*8
+---------+---------+
| Mean(x )| Var(x ) |
+---------+---------+
n n
byte address
payloads
Advantages
“Ascompactasitgets”representation
Smallerserialisationoverhead
Fastrelativeaccess
Easytoimplementatomicin-memoryoperations
makeitrocket-fast
Step4
ApproximateData
Structures
BloomFilters
arebasicallylongarrays/vectors
BitSet
0 8
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
8 16
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
16 24
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
24 32
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
+---+---+---+---+---+---+---+---+
...
bit address
Advantages
64bitsper8-byteLong
Easytorepresentbythelong-arrayusing
offsets, bitshiftsandmasks
Easytoimplementatomicin-memoryoperations
Count-minsketches
arebasicallyintmatrices
0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
n*8+ 0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
m*n*8+ 0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
m1 m2 m3 m4m0 mn
Histograms
arebasicallylongvectors
0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400 1n
0 8 16 24 32 40 n*8
+----+----+----+----+----+---------+----+
| α | α | α | α | α | ... | α |
+----+----+----+----+----+---------+----+
01 02 03 0400
1n
byte address
byte address
Longs (counts)
Doubles (bin start number)
Conclusions
Ad-hocqueries
Parallelism
LightweightDSsrepresentation
OptimisationsandgoodAPIfits
@ifesdjeen
http://bit.ly/cassandrasummit2015

codecentric AG: Using Cassandra and Clojure for Data Crunching backends