Data Stream Algorithms
in Storm and R
Radek Maciaszek
Who Am I?
 Radek Maciaszek
 Consulting at DataMine Lab (www.dataminelab.com) - Data mining,
business intelligence and data warehouse consultancy.
 Data scientist at a hedge fund in London
 BSc Computer Science, MSc in Cognitive and Decisions Sciences, MSc
in Bioinformatics
 During the career worked with many companies on Big Data and real
time processing projects; indcluding Orange, ad4game, Unanimis,
SkimLinks, CognitiveMatch, OpenX and many others.
Agenda
• Why streaming algorithms?
• Streaming algorithms crash course
• Apache Storm
• Storm + R – for more traditional statistics
• Use Cases
Data Explosion
• Exponential growth of information
[IDC, 2012] Information Data Corporation, a market research company
Data, data everywhere [Economist]
• “In 2013, the available storage capacity could hold 33% of all
data. By 2020, it will be able to store less than 15%” [IDC, 2014]
Data Streams – crash course
• Reasons to use data streams processing
• Data doesn’t fit into available memory and/or disk
• (Near) real-time data processing
• Scalability, cloud processing
• Examples
• Network traffic (ISP)
• Fraud detection
• Web traffic (i.e. online advertising)
Use Case – Dynamic Sampling
• OpenX – the ad server
• Customers with tens of millions of ad views per hour
• Challenge
• Create samples for statistical analysis. E.g: A/B testing, ANOVA, etc.
• How to sample data in real-time on the input stream of the data of
unknown size
• Solution
• Reservoir Sampling – allows to find a sample of a constant length from a
stream of unknown length of elements
Data Streaming algorithms
• Sampling
• Use statistic of a sample to estimate the statistic of population. The
bigger the sample the better the estimate.
• Reservoir Sampling – sample populations, without knowing it’s size.
• Algorithm:
• Store first n elements into the reservoir.
• Insert each k-th from the input stream in a random spot of the reservoir
with a probability of n/k (decreasing probability)
Source: Maldonado, et al; 2011
Moving average
• Example, online mean of the moving average of time-series at
time “t”
• Where: M – window size of the moving average.
• At time “t” predict “t+1” by removing last “t-M” element, and
adding “t” element.
• Requires last M elements to be stored in memory
• There are many more algorithms: mean, variance, regression,
percentile
Use Case - Counting Unique Users
• Large UK ad-network
• Challenge – calculate number of unique visitors - one of the
most important metrics in online advertising
• Hadoop MapReduce. It worked but took long time and too
much memory.
• Better solutions:
• Cardinality estimation on stream data, e.g. HyperLogLog algorithm
• Highly effective algorithm to count distinct number of elements
• Many other use cases:
• ISP estimates of traffic usage
• Cardinality in DB queries optimisation
• Algorithm and implementation details
• Transform input data into i.i.d. (independent and identically distributed)
uniform random bits of information
• Hash(x)= bit1 bit2 …
• Where P(bit1)=P(bit2)=1/2
• 1xxx -> P = 1/2, n >= 2
11xx -> P = 1/4, n >= 4
111x -> P = 1/8, n >= 8
n >=
• Record biggest
• Flajolet (1983) estimated the bias
• and
p = position of a first “0”
• 1983 - Flajolet & Martin. (first streaming algorithm)
Probabilistic counting
unhashed hashed
Source: http://git.io/veCtc
Probabilistic counting - algorithm
• Algorithm: p – calculates position of first zero in the bitmap
• Estimate the size using:
• R proof-of-concept implementation: http://git.io/ve8Ia
• Example:
Can we do better?
• LogLog – instead of keeping track of all 01s, keep track only of
the largest 0
• This will take LogLog bits, but at the cost of lost precision
• SuperLogLog – remove x% (typically 70%) of largest number
before estimating, more complex analysis
• HyperLogLog – harmonic mean of estimates
• Fast, cheap and 98% correct
• What if you want more traditional statistics?
Reference: Flajolet; Fusy et al. 2007
R – Open Source Statistics
• Open Source = low cost of adopting. Useful in prototyping.
• Large global community - more than 2.5 million users
• ~5,000 open source free packages
• Extensively used for modelling and visualisations
Source: Rexer Analytics
Use Case – Real-time Machine Learning
• Gaming ad-network
• 150m+ ad impressions per day
• Lambda architecture (fast and batch layers): Storm used in
parallel to Hadoop
• Challenge
• Make real-time decision on which ad to display – vs old system that used
to make decisions every 1h
• Use sophisticated statistical environment for A/B testing
• Solution
• Beta Distribution to compare effectiveness of the ads
• Use Storm to do real-time statistics
Use Case – Beta Distributions
• Comparing two ads:
• Ratio: CTR = Clicks / Views
• Wolphram Alpha: beta distribution (5, (30-5))
Source: Wolphram Alpha
Beta distributions prototyping – the R code
• Bootstrapping in R
Apache Storm
• Real-time calculations – the Hadoop of real time
• Fault tolerance, easy to scale
• Easy to develop - has local and distributed mode
• Storm multi-lang can be used with any language, including R
Getty Images
Storm Architecture
• Nimbus
• Master - equivalent of Hadoop JobTracker
• Distributes workload across cluster
• Heartbeat, reallocation of workers when needed
• Supervisor
• Runs the workers
• Communicates with Nimbus
using ZK
• Zookeeper
• coordination,
nodes discovery
Source: Apache Storm
Storm Topology
Image source: Storm github wiki
Can integrate with third party
languages and databases:
• Java
• Python
• Ruby
• Redis
• Hbase
• Cassandra
• Graph of stream computations
• Basic primitives nodes
• Spout – source of streams (Twitter API, queue, logs)
• Bolt – consumes streams, does the work, produces
streams
• Storm Trident
Storm + R
• Storm Multi-Language protocol
• Multiple Storm-R multi-language packages
provide Storm/R plumbing
• Recommended package: http://cran.r-
project.org/web/packages/Storm
• Example R code
Storm and R
storm = Storm$new();
storm$lambda = function(s) {
t = s$tuple;
t$output =
vector(mode="character",length=1);
clicks = as.numeric(t$input[1]);
views = as.numeric(t$input[2]);
t$output[1] = rbeta(1, clicks, views -
clicks);
s$emit(t);
#alternative: mark the tuple as failed.
s$fail(t);
}
storm$run();
Storm and Java integration
• Define Spout/Bolt in any programming language
• Executed as subprocess – JSON over stdin/stdout
public static class RBolt extends ShellBolt
implements IRichBolt {
public RBolt() {
super("Rscript", ”script.R");
}
}
Source: Apache Storm
Storm + R = flexibility
• Integration with existing Storm ecosystem – NoSQL, Kafka
• SOA framework - DRPC
• Scaling up your existing R processes
• Trident
Source: Apache Storm
Storm References
• https://storm.apache.org
• Storm and Java stream algorithms implementations:
• https://github.com/addthis/stream-lib
• https://github.com/aggregateknowledge/java-hll
• https://github.com/pmerienne/trident-ml
Thank you
• Summary:
• Data stream algorithms
• Storm – can be used with stream algorithms
• Storm + R – more traditional
• Questions and discussion
• https://uk.linkedin.com/in/radekmaciaszek
• http://www.dataminelab.com

Data Stream Algorithms in Storm and R

  • 1.
    Data Stream Algorithms inStorm and R Radek Maciaszek
  • 2.
    Who Am I? Radek Maciaszek  Consulting at DataMine Lab (www.dataminelab.com) - Data mining, business intelligence and data warehouse consultancy.  Data scientist at a hedge fund in London  BSc Computer Science, MSc in Cognitive and Decisions Sciences, MSc in Bioinformatics  During the career worked with many companies on Big Data and real time processing projects; indcluding Orange, ad4game, Unanimis, SkimLinks, CognitiveMatch, OpenX and many others.
  • 3.
    Agenda • Why streamingalgorithms? • Streaming algorithms crash course • Apache Storm • Storm + R – for more traditional statistics • Use Cases
  • 4.
    Data Explosion • Exponentialgrowth of information [IDC, 2012] Information Data Corporation, a market research company
  • 5.
    Data, data everywhere[Economist] • “In 2013, the available storage capacity could hold 33% of all data. By 2020, it will be able to store less than 15%” [IDC, 2014]
  • 6.
    Data Streams –crash course • Reasons to use data streams processing • Data doesn’t fit into available memory and/or disk • (Near) real-time data processing • Scalability, cloud processing • Examples • Network traffic (ISP) • Fraud detection • Web traffic (i.e. online advertising)
  • 7.
    Use Case –Dynamic Sampling • OpenX – the ad server • Customers with tens of millions of ad views per hour • Challenge • Create samples for statistical analysis. E.g: A/B testing, ANOVA, etc. • How to sample data in real-time on the input stream of the data of unknown size • Solution • Reservoir Sampling – allows to find a sample of a constant length from a stream of unknown length of elements
  • 8.
    Data Streaming algorithms •Sampling • Use statistic of a sample to estimate the statistic of population. The bigger the sample the better the estimate. • Reservoir Sampling – sample populations, without knowing it’s size. • Algorithm: • Store first n elements into the reservoir. • Insert each k-th from the input stream in a random spot of the reservoir with a probability of n/k (decreasing probability) Source: Maldonado, et al; 2011
  • 9.
    Moving average • Example,online mean of the moving average of time-series at time “t” • Where: M – window size of the moving average. • At time “t” predict “t+1” by removing last “t-M” element, and adding “t” element. • Requires last M elements to be stored in memory • There are many more algorithms: mean, variance, regression, percentile
  • 10.
    Use Case -Counting Unique Users • Large UK ad-network • Challenge – calculate number of unique visitors - one of the most important metrics in online advertising • Hadoop MapReduce. It worked but took long time and too much memory. • Better solutions: • Cardinality estimation on stream data, e.g. HyperLogLog algorithm • Highly effective algorithm to count distinct number of elements • Many other use cases: • ISP estimates of traffic usage • Cardinality in DB queries optimisation • Algorithm and implementation details
  • 11.
    • Transform inputdata into i.i.d. (independent and identically distributed) uniform random bits of information • Hash(x)= bit1 bit2 … • Where P(bit1)=P(bit2)=1/2 • 1xxx -> P = 1/2, n >= 2 11xx -> P = 1/4, n >= 4 111x -> P = 1/8, n >= 8 n >= • Record biggest • Flajolet (1983) estimated the bias • and p = position of a first “0” • 1983 - Flajolet & Martin. (first streaming algorithm) Probabilistic counting unhashed hashed Source: http://git.io/veCtc
  • 12.
    Probabilistic counting -algorithm • Algorithm: p – calculates position of first zero in the bitmap • Estimate the size using: • R proof-of-concept implementation: http://git.io/ve8Ia • Example:
  • 13.
    Can we dobetter? • LogLog – instead of keeping track of all 01s, keep track only of the largest 0 • This will take LogLog bits, but at the cost of lost precision • SuperLogLog – remove x% (typically 70%) of largest number before estimating, more complex analysis • HyperLogLog – harmonic mean of estimates • Fast, cheap and 98% correct • What if you want more traditional statistics? Reference: Flajolet; Fusy et al. 2007
  • 14.
    R – OpenSource Statistics • Open Source = low cost of adopting. Useful in prototyping. • Large global community - more than 2.5 million users • ~5,000 open source free packages • Extensively used for modelling and visualisations Source: Rexer Analytics
  • 15.
    Use Case –Real-time Machine Learning • Gaming ad-network • 150m+ ad impressions per day • Lambda architecture (fast and batch layers): Storm used in parallel to Hadoop • Challenge • Make real-time decision on which ad to display – vs old system that used to make decisions every 1h • Use sophisticated statistical environment for A/B testing • Solution • Beta Distribution to compare effectiveness of the ads • Use Storm to do real-time statistics
  • 16.
    Use Case –Beta Distributions • Comparing two ads: • Ratio: CTR = Clicks / Views • Wolphram Alpha: beta distribution (5, (30-5)) Source: Wolphram Alpha
  • 17.
    Beta distributions prototyping– the R code • Bootstrapping in R
  • 18.
    Apache Storm • Real-timecalculations – the Hadoop of real time • Fault tolerance, easy to scale • Easy to develop - has local and distributed mode • Storm multi-lang can be used with any language, including R Getty Images
  • 19.
    Storm Architecture • Nimbus •Master - equivalent of Hadoop JobTracker • Distributes workload across cluster • Heartbeat, reallocation of workers when needed • Supervisor • Runs the workers • Communicates with Nimbus using ZK • Zookeeper • coordination, nodes discovery Source: Apache Storm
  • 20.
    Storm Topology Image source:Storm github wiki Can integrate with third party languages and databases: • Java • Python • Ruby • Redis • Hbase • Cassandra • Graph of stream computations • Basic primitives nodes • Spout – source of streams (Twitter API, queue, logs) • Bolt – consumes streams, does the work, produces streams • Storm Trident
  • 21.
    Storm + R •Storm Multi-Language protocol • Multiple Storm-R multi-language packages provide Storm/R plumbing • Recommended package: http://cran.r- project.org/web/packages/Storm • Example R code
  • 22.
    Storm and R storm= Storm$new(); storm$lambda = function(s) { t = s$tuple; t$output = vector(mode="character",length=1); clicks = as.numeric(t$input[1]); views = as.numeric(t$input[2]); t$output[1] = rbeta(1, clicks, views - clicks); s$emit(t); #alternative: mark the tuple as failed. s$fail(t); } storm$run();
  • 23.
    Storm and Javaintegration • Define Spout/Bolt in any programming language • Executed as subprocess – JSON over stdin/stdout public static class RBolt extends ShellBolt implements IRichBolt { public RBolt() { super("Rscript", ”script.R"); } } Source: Apache Storm
  • 24.
    Storm + R= flexibility • Integration with existing Storm ecosystem – NoSQL, Kafka • SOA framework - DRPC • Scaling up your existing R processes • Trident Source: Apache Storm
  • 25.
    Storm References • https://storm.apache.org •Storm and Java stream algorithms implementations: • https://github.com/addthis/stream-lib • https://github.com/aggregateknowledge/java-hll • https://github.com/pmerienne/trident-ml
  • 26.
    Thank you • Summary: •Data stream algorithms • Storm – can be used with stream algorithms • Storm + R – more traditional • Questions and discussion • https://uk.linkedin.com/in/radekmaciaszek • http://www.dataminelab.com