With Akka Streams & Kafka
Intro & Agenda Crawler Intro & Problem
Statements
Crawler Architecture
Infrastructure: Akka Streams,
Kafka, etc.
The Goodies
Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
High-Level View
Requirements Ever-expanding # of URLs
Can’t crawl all URLs at once
Control over concurrent web GETs
Efficient resource usage
Resilient under high burst
Scales horizontally & vertically
Sizing the Crawl Job
Let:
i = Number of seed URLs in a job
n = Average number of links per page
d = The crawl depth
(how many layers to follow links)
u = The max number of URLs to process
Then:
u = ind
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
0 2 4 6 8 10 12
totalURLs vs depth
depth (initialURLs = 1, outLinks = 5)
1.00E+00
1.00E+01
1.00E+02
1.00E+03
1.00E+04
1.00E+05
1.00E+06
1.00E+07
1.00E+08
1.00E+09
1.00E+10
1.00E+11
1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07
totalURLs vs initialURLs
initialURLs (depth = 5, outLinks = 5)
The Reactive Manifesto
Responsive
Message Driven
Elastic Resilient
Why Does it Matter?
Respond in a deterministic, timely manner
Stays responsive in the face of failure – even cascading failures
Stays responsive under workload spikes
Basic building block for responsive, resilient, and elastic systems
Responsive
Resilient
Elastic
Message Driven
The Right Ingredients
• Kafka
• Huge persistent buffer for the bursts
• Load distribution to very large number of
processing nodes
• Enable horizontal scalability
• Akka streams
• High performance, highly efficient processing
pipeline
• Resilient with end-to-end back-pressure
• Fully asynchronous – utilizes mapAsyncUnordered
with Async HTTP client
• Async HTTP client
• Non-blocking and consumes no threads in waiting
• Integrates with Akka Streams for a high
parallelism, low resource solution
Efficient
Resilient
Scale
Akka
Stream
Async
HTTP
Reactive
Kafka
Crawl
Jobs
Job DB
Validate
URL
Cache
Downloa
d
Process
URLs
URLs
Timestamps
Adding Kafka & Akka Streams
URLs
Akka Streams
Akka Streams,
what???
High performance, pure async,
stream processing
Conforms to reactive streams
Simple, yet powerful GraphDSL
allows clear stream topology
declaration
Central point to understand
processing pipeline
Crawl Stream
Actual Stream Declaration in Code
prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink
bCast ~> dataSinkFlow ~> kafkaDataSink
bCast ~> hdfsDataSink
bCast ~> graphFlow ~> merge ~> graphSink
bCast0 ~> maxPage ~> merge
bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge
bCastRetry ~> errorSink
Prioritized
Source
Crawl
Result
MaxPageReached
Retry
OutLinks
Data
Graph
CheckFail
CheckErr
OutLinks
Sink
Kafka Data
Sink
HDFS Data
Sink
Graph
Sink
Error
Sink
Resulting Characteristics
Efficient
• Low thread count, controlled by Akka and pure non-blocking async HTTP
• High latency URLs do not block low latency URLs using MapAsyncUnordered
• Well-controlled download concurrency using MapAsyncUnordered
• Thread per concurrent crawl job
Resilient
• Processes only what can be processed – no resource overload
• Kafka as short-term, persistent queue
Scale
• Kafka feeds next batch of URLs to available client
• Pull model – only processes that have capacity will get the load
• Kafka distributes work to large number of processing nodes in cluster
Back-Pressure
0
20000
40000
60000
80000
100000
120000
0 100 200 300 400 500 600 700
Queue Size
Time (seconds)
0
200
400
URLs/sec
Time (seconds)
seedURLs : 100
parallelism : 1000
processTime : 1 – 5
s
outLinks : 0 - 10
depth : 5
totalCrawled :
312500
Challenges
Training
• Developers not used to E2E stream
definitions
• More familiar with deeply nested function
calls
Maturity of Infrastructure
• Kafka 0.9 use fetch as heartbeat
• Slow nodes cause timeout & rebalance
• Solved in 0.10.0.1
What it would
have been…
Bloated, ineffective concurrency
control
Lack of well-thought-out and visible
processing pipeline
Clumsy code, hard to manage &
understand
Low training cost, high project TCO
Dev / Support / Maintenance
Bottom Line
Standardized Reactive Platform
Efficiency & Resilience meets Standardization
• Monitoring
• Need to collect metrics, consistently
• Logging
• Correlation across services
• Uniformity in logs
• Security
• Need to apply standard security configuration
• Environment Resolution
• Staging, production, etc.
Consistency in the face of Heterogeneity
squbs is not… A framework by its own
A programming model – use Akka
Take all or none –
Components/patterns can mostly be
used independently
squbs
Akka for large
scale deployments
Bootstrap
Lifecycle management
Loosely-coupled module system
Integration hooks for logging,
monitoring, ops integration
squbs
Akka for large
scale deployments
JSON console
HttpClient with pluggable resolver and
monitoring/logging hooks
Test tools and interfaces
Goodies:
- Activators for Scala & Java
- Programming patterns and helpers for
Akka and Akka Stream Use cases…,
and growing
Performance Akka is principally designed for
great performance & scalability
Actor scheduling, dispatcher,
message batching
• throughput parameter
The job for squbs is adding ops
functionality without impacting
performance
squbs
Performance
for Akka Http
Pipeline (auth, logging, etc.) built as
Akka Streams BidiFlow
Pipeline supplied by infra or app
App code supplies Flow or Route
Pipeline and app flow baked into
single Request/Response Flow
Fully utilizes Akka-Stream fusing
Zero Overhead for Given Functionality
Application
Performance
Tips
Kafka – great firehose, with a bit
latency
Only convert byte ↔ char when
absolutely needed
Work with ByteString if you can
Need better facilities (ByteString
JSON parser, etc.)
Remember : App has biggest impact on performance, not tuning
Akka
Performance
Tuning
Parallelism Factor: 1.0 is optimal
• Minimizes context switches
Use other dispatcher for blocking
Test your throughput setting
Try and test different GC settings
The Goodies
PerpetualStream
• Provides a convenience trait to help
write streams controlled by system
lifecycle
• Minimal/no message losses
• Register PerpetualStream to make
stream start/stop
• Provides customization hooks –
especially for how to stop the stream
• Provides killSwitch (from Akka) to be
embedded into stream
• Implementers - just provide your
stream!
A non-stop stream; starts and stops with the system
class MyStream extends PerpetualStream[Future[Int]] {
def generator = Iterator.iterate(0) { p =>
if (p == Int.MaxValue) 0 else p + 1
}
val source = Source.fromIterator(generator _)
val ignoreSink = Sink.ignore[Int]
override def streamGraph = RunnableGraph.fromGraph(
GraphDSL.create(ignoreSink) { implicit builder =>
sink =>
import GraphDSL.Implicits._
source ~> killSwitch.flow[Int] ~> sink
ClosedShape
})
}
PersistentBuffer/BroadcastBuffer
• Data & indexes in rotating memory-mapped files
• Off-heap rotating file buffer – very large buffers
• Restarts gracefully with no or minimal message loss
• Not as durable as a remote data store, but much faster
• Does not back-pressure upstream beyond data/index writes
• Similar usage to Buffer and Broadcast
• BroadcastBuffer – a FanOutShape decouples each output port making each downstream
independent
• Useful if downstream stage blocked or unavailable
• Kafka is unavailable/rebalancing but system cannot backpressure/deny incoming
traffic
• Optional commit stage for at-least-once delivery semantics
• Implementation based on Chronicle Queue
A buffer of virtually unlimited size
Summary
• Kafka + Akka Streams + Async I/O = Ideal Architecture for High Bursts
& High Efficiency
• Akka Streams
• Clear view of stream topology
• Back-pressure & Kafka allows buffering load bursts
• Standardization
• Walk like a duck, quack like a duck, and manage it like a duck
• squbs: Have the cake, and eat it too
• Functionality without sacrificing performance
• Goodies like PerpetualStream, PersistentBuffer, & BroadcastBuffer
Q&A – Feedback Appreciated
Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

Lessons Learned From PayPal: Implementing Back-Pressure With Akka Streams And Kafka

  • 1.
  • 2.
    Intro & AgendaCrawler Intro & Problem Statements Crawler Architecture Infrastructure: Akka Streams, Kafka, etc. The Goodies
  • 3.
  • 4.
    Requirements Ever-expanding #of URLs Can’t crawl all URLs at once Control over concurrent web GETs Efficient resource usage Resilient under high burst Scales horizontally & vertically
  • 5.
    Sizing the CrawlJob Let: i = Number of seed URLs in a job n = Average number of links per page d = The crawl depth (how many layers to follow links) u = The max number of URLs to process Then: u = ind 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 0 2 4 6 8 10 12 totalURLs vs depth depth (initialURLs = 1, outLinks = 5) 1.00E+00 1.00E+01 1.00E+02 1.00E+03 1.00E+04 1.00E+05 1.00E+06 1.00E+07 1.00E+08 1.00E+09 1.00E+10 1.00E+11 1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 totalURLs vs initialURLs initialURLs (depth = 5, outLinks = 5)
  • 6.
  • 7.
    Why Does itMatter? Respond in a deterministic, timely manner Stays responsive in the face of failure – even cascading failures Stays responsive under workload spikes Basic building block for responsive, resilient, and elastic systems Responsive Resilient Elastic Message Driven
  • 8.
    The Right Ingredients •Kafka • Huge persistent buffer for the bursts • Load distribution to very large number of processing nodes • Enable horizontal scalability • Akka streams • High performance, highly efficient processing pipeline • Resilient with end-to-end back-pressure • Fully asynchronous – utilizes mapAsyncUnordered with Async HTTP client • Async HTTP client • Non-blocking and consumes no threads in waiting • Integrates with Akka Streams for a high parallelism, low resource solution Efficient Resilient Scale Akka Stream Async HTTP Reactive Kafka
  • 9.
  • 10.
    Akka Streams, what??? High performance,pure async, stream processing Conforms to reactive streams Simple, yet powerful GraphDSL allows clear stream topology declaration Central point to understand processing pipeline
  • 11.
    Crawl Stream Actual StreamDeclaration in Code prioritizeSource ~> crawlerFlow ~> bCast0 ~> result ~> bCast ~> outLinksFlow ~> outLinksSink bCast ~> dataSinkFlow ~> kafkaDataSink bCast ~> hdfsDataSink bCast ~> graphFlow ~> merge ~> graphSink bCast0 ~> maxPage ~> merge bCast0 ~> retry ~> bCastRetry ~> retryFailed ~> merge bCastRetry ~> errorSink Prioritized Source Crawl Result MaxPageReached Retry OutLinks Data Graph CheckFail CheckErr OutLinks Sink Kafka Data Sink HDFS Data Sink Graph Sink Error Sink
  • 12.
    Resulting Characteristics Efficient • Lowthread count, controlled by Akka and pure non-blocking async HTTP • High latency URLs do not block low latency URLs using MapAsyncUnordered • Well-controlled download concurrency using MapAsyncUnordered • Thread per concurrent crawl job Resilient • Processes only what can be processed – no resource overload • Kafka as short-term, persistent queue Scale • Kafka feeds next batch of URLs to available client • Pull model – only processes that have capacity will get the load • Kafka distributes work to large number of processing nodes in cluster
  • 13.
    Back-Pressure 0 20000 40000 60000 80000 100000 120000 0 100 200300 400 500 600 700 Queue Size Time (seconds) 0 200 400 URLs/sec Time (seconds) seedURLs : 100 parallelism : 1000 processTime : 1 – 5 s outLinks : 0 - 10 depth : 5 totalCrawled : 312500
  • 14.
    Challenges Training • Developers notused to E2E stream definitions • More familiar with deeply nested function calls Maturity of Infrastructure • Kafka 0.9 use fetch as heartbeat • Slow nodes cause timeout & rebalance • Solved in 0.10.0.1
  • 15.
    What it would havebeen… Bloated, ineffective concurrency control Lack of well-thought-out and visible processing pipeline Clumsy code, hard to manage & understand Low training cost, high project TCO Dev / Support / Maintenance
  • 16.
  • 17.
  • 18.
    Efficiency & Resiliencemeets Standardization • Monitoring • Need to collect metrics, consistently • Logging • Correlation across services • Uniformity in logs • Security • Need to apply standard security configuration • Environment Resolution • Staging, production, etc. Consistency in the face of Heterogeneity
  • 19.
    squbs is not…A framework by its own A programming model – use Akka Take all or none – Components/patterns can mostly be used independently
  • 20.
    squbs Akka for large scaledeployments Bootstrap Lifecycle management Loosely-coupled module system Integration hooks for logging, monitoring, ops integration
  • 21.
    squbs Akka for large scaledeployments JSON console HttpClient with pluggable resolver and monitoring/logging hooks Test tools and interfaces Goodies: - Activators for Scala & Java - Programming patterns and helpers for Akka and Akka Stream Use cases…, and growing
  • 22.
    Performance Akka isprincipally designed for great performance & scalability Actor scheduling, dispatcher, message batching • throughput parameter The job for squbs is adding ops functionality without impacting performance
  • 23.
    squbs Performance for Akka Http Pipeline(auth, logging, etc.) built as Akka Streams BidiFlow Pipeline supplied by infra or app App code supplies Flow or Route Pipeline and app flow baked into single Request/Response Flow Fully utilizes Akka-Stream fusing Zero Overhead for Given Functionality
  • 24.
    Application Performance Tips Kafka – greatfirehose, with a bit latency Only convert byte ↔ char when absolutely needed Work with ByteString if you can Need better facilities (ByteString JSON parser, etc.) Remember : App has biggest impact on performance, not tuning
  • 25.
    Akka Performance Tuning Parallelism Factor: 1.0is optimal • Minimizes context switches Use other dispatcher for blocking Test your throughput setting Try and test different GC settings
  • 26.
  • 27.
    PerpetualStream • Provides aconvenience trait to help write streams controlled by system lifecycle • Minimal/no message losses • Register PerpetualStream to make stream start/stop • Provides customization hooks – especially for how to stop the stream • Provides killSwitch (from Akka) to be embedded into stream • Implementers - just provide your stream! A non-stop stream; starts and stops with the system class MyStream extends PerpetualStream[Future[Int]] { def generator = Iterator.iterate(0) { p => if (p == Int.MaxValue) 0 else p + 1 } val source = Source.fromIterator(generator _) val ignoreSink = Sink.ignore[Int] override def streamGraph = RunnableGraph.fromGraph( GraphDSL.create(ignoreSink) { implicit builder => sink => import GraphDSL.Implicits._ source ~> killSwitch.flow[Int] ~> sink ClosedShape }) }
  • 28.
    PersistentBuffer/BroadcastBuffer • Data &indexes in rotating memory-mapped files • Off-heap rotating file buffer – very large buffers • Restarts gracefully with no or minimal message loss • Not as durable as a remote data store, but much faster • Does not back-pressure upstream beyond data/index writes • Similar usage to Buffer and Broadcast • BroadcastBuffer – a FanOutShape decouples each output port making each downstream independent • Useful if downstream stage blocked or unavailable • Kafka is unavailable/rebalancing but system cannot backpressure/deny incoming traffic • Optional commit stage for at-least-once delivery semantics • Implementation based on Chronicle Queue A buffer of virtually unlimited size
  • 29.
    Summary • Kafka +Akka Streams + Async I/O = Ideal Architecture for High Bursts & High Efficiency • Akka Streams • Clear view of stream topology • Back-pressure & Kafka allows buffering load bursts • Standardization • Walk like a duck, quack like a duck, and manage it like a duck • squbs: Have the cake, and eat it too • Functionality without sacrificing performance • Goodies like PerpetualStream, PersistentBuffer, & BroadcastBuffer
  • 30.
    Q&A – FeedbackAppreciated