Approximation algorithms for stream and batch processing

Copyright © 2014 Improve Digital - All Rights Reserved
Approximation algorithms for
stream and batch processing
Gabriele Modena
Data Scientist Improve Digital 
E: g.modena@improvedigital.com

Real Time Advertisement Technology
Media Owners Advertisers

3
Adtech 101
<150 msec
• Geographically distributed adserver fleet
• 200+ billion events / month
• Hundreds of TB in a Hadoop cluster

4
– How much revenue did publisher X generate last month? Which
are the top advertisers?
• Reporting & BI
– Is the day-to-day traffic on site Y increasing or decreasing?
• Trend analysis
– Is the traffic legit or coming from a botnet ?
• Fraud detection
– How likely is this impression to generate a click or a conversion?
• Predictive modelling
– How are advertisers bidding and buying on inventory? Who is
our audience?
• Pattern Recognition
Improve digital data platform

Copyright © 2014 Improve Digital - All Rights Reserved 5
Historically
• Batch pipelines
• Incremental processing
• Realtime pipelines
• Monitoring and trend analysis
!
Batch dataset != Realtime dataset
Batch models != Realtime models

6
• Write jobs once
• Unifiy models and
• Analytics codebase
• Datasets semantic
• Experimentation
Goals

7
Analytics Architecture
Real-time
log
collection
Brokerage
(Kakfa
+Samza)
Processing
(YARN+Spark
+MapReduce)
Push Expose
Publish
Publish
Publish
Datab
ase
HDFS
Redis

8
Kafka and Samza
• Kafka (http://kafka.apache.org) as a
distributed message queue
• Topic-based
• Producers write, consumers read
• Messages are persistently stored – topics
can be re-read
• We use Samza for coordinating ingestion, ETL
and distributed stream processing

9
Apache Spark
• Spark (Zaharia et al. 2010)
• “Iterative” computing
• Generalization of MapReduce (Isard 2007)
• Runs atop Hadoop (YARN) 
!
• Spark Streaming
• Break data into batches and pass it to
Spark engine (same API & data structures)

10
Challenges
• Conceptually everything is a stream
• Satisfy a tradeoff between
• Latency
• Memory
• Accuracy 
• On infinitely expanding datasets

Make big data small
Samples, sketches and summaries

12
Reservoir Sampling (Vitter, 1985)
• Hard to parallelize
• How to use samples to answer certain queries?
Count distinct? TopK?
• From an infinitely expanding dataset
• With constant memory and in a single pass

Cardinality estimation (count distinct)
How many users are visiting a site?

14
Claim
The cardinality of a multiset of
uniformly-distributed random
numbers can be estimated by
calculating the maximum number
of leading zeros in the binary
representation of each number in
the set.

15
Intuitively 
1. Apply an hash function on each element and
take the binary representation of the output
2. If the maximum number of leading zeros
observed is n, an estimate for the number of
distinct elements in the set is 2^n
3. Account for variance by averaging on subsets
HyperLogLog (Flajolet, Philippe, et al. 2008)

16
val hll = new HyperLogLogMonoid(12)
!
val approxUsers = users.mapPartitions(user => user.map(uuid =>
hll(uuid.getBytes))).reduce(_ + _)
!
var h = globalHll.zero
approxUsers.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
h += partial
}
})
HyperLogLog (with Spark + Algebird)

17
HyperLogLog (< 2% error rate in 15kB)
Count
Exact
Approximate
Memory

Frequency estimation
Top 10 most visited sites (out of a few millions) ?

19
Count Min Sketch
(Cormode, Graham, and S. Muthukrishnan, 2005)
It’s the hashing trick!

20
val eps = 0.01
val delta = 1E-3
val seed = 1
val perc = 0.003
!
val approxImpressions = publishers.mapPartitions(publisher => {
val cms = new CountMinSketchMonoid(delta, eps, seed, perc)
publisher.map(publisher_id => cms.create(publisher_id.toLong))
}).reduce(_ ++ _)
!
var globalCMS = new CountMinSketchMonoid(delta, eps, seed, perc).zero
approxTopUsers.foreach(rdd => {
if (rdd.count() != 0) {
val partial = rdd.first()
globalCMS ++= partial
val globalTopK = globalCMS.heavyHitters.map(id => (id,
globalCMS.frequency(id).estimate)).toSeq.sortBy(_._2).reverse.slice(0, 5)
}
})
CMS (with Spark + Algebird)

21
CMS results
Exact Approximate

Learning from data

24
• Liner Regression
– OLS + SGD on batches of data
– Recursive Least Squares with Forgetting
(Vahidi et al. 2005) 
• Streaming kmeans (Ailon et al. 2009, Shindler
et al 2011, Ostrovsky et al. 2012)
– Single iteration-to-convergence
– Use sketches to reduce dimensionality (k log
N centroids)
– Mini batch updates + forgetfulness
Using sketches

25
• Streaming is part of the broader system
• Approximation can help us scale both
streaming and batch loads
– Make “big data” small
– Unification
• Data collection and distribution is key
▪ Publishing results follows
• Large scale analytics = Architecture + Algos +
Data Structures
Conclusion

Approximation algorithms for stream and batch processing

More Related Content

What's hot

Similar to Approximation algorithms for stream and batch processing

Recently uploaded

Approximation algorithms for stream and batch processing