Fault Tolerance at Speed
Todd L. Montgomery
@toddlmontgomery
StoneTor
InfoQ.com: News & Community Site
• Over 1,000,000 software developers, architects and CTOs read the site world-
wide every month
• 250,000 senior developers subscribe to our weekly newsletter
• Published in 4 languages (English, Chinese, Japanese and Brazilian
Portuguese)
• Post content from our QCon conferences
• 2 dedicated podcast channels: The InfoQ Podcast, with a focus on
Architecture and The Engineering Culture Podcast, with a focus on building
• 96 deep dives on innovative topics packed as downloadable emags and
minibooks
• Over 40 new content items per week
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
aeron-cluster-raft/
Purpose of QCon
- to empower software development by facilitating the spread of
knowledge and innovation
Strategy
- practitioner-driven conference designed for YOU: influencers of
change and innovation in your teams
- speakers and topics driving the evolution and innovation
- connecting and catalyzing the influencers and innovators
Highlights
- attended by more than 12,000 delegates since 2007
- held in 9 cities worldwide
Presented at QCon San Francisco
www.qconsf.com
About me…
What type of Fault Tolerance?
What is Clustering?
Why Aeron?
Design for Speeding Up?
What type of Fault Tolerance?
What is Clustering?
Why Aeron?
Design for Speeding Up?
Efficiency
https://www.forbes.com/sites/forbestechcouncil/2017/12/15/why-energy-is-a-big-and-rapidly-growing-problem-for-data-centers/#344456665a30
https://www.datacenterdynamics.com/opinions/power-consumption-data-centers-global-problem/
https://www.nature.com/articles/d41586-018-06610-y
We seem to assume
efficiency/security/quality/etc.
is a “special” characteristic added
… later… if at all
Fault Tolerance
Service
Client
Service
Client
Service
Client
ServiceService
Service
Client
ServiceService
Client Client
Service
Client
ServiceService
Client Client
State
Service ServiceService
State “Storage”
Service
Client
ServiceService
Client Client
State
Fault Tolerance of State
Service ServiceService
State
Partition Replication
Contiguous Log
with
Snapshot & Replay
1
2
3
4
5
6
X
…
1
State
2
3
4
5
6
X
…
1
State
2
3
4
5
6
X
…
Snapshot
1
State
2
3
4
5
6
X
…
Snapshot
5
6
X
…
Snapshot
State
Clustered Services
Service ServiceService
Service ServiceService
Log ArchiveLog Archive Log Archive
Replicated State Machines
https://en.wikipedia.org/wiki/State_machine_replication
Each Replicated Service
Same event log
Same input ordering
Log replicated locally
Replicated State Machines
Checkpoints / Snapshots
Event in the log
“Rolling” up previous log events
Replicated State Machines
When should a service “consume”
(or process) a log event?
Service ServiceService
ArchiveArchive Archive
1 2 3 4 5 6 1 2 3 4 5 6 71 2
Once processed,
Event can not be altered
Only process once event is stable
Raft Consensus
Event must be recorded at majority
of Replicas before being consumed
by any Replica
Replicated State Machines
https://raft.github.io/
Service ServiceService
ArchiveArchive Archive
1 2 3 4 5 6 1 2 3 4 5 6 71 2
Service ServiceService
ArchiveArchive Archive
1 2 3 4 5 6 1 2 3 4 5 6 71 2
Strong Leader
Elected member of the Cluster
Orders Input
Disseminates Consensus
Raft
Service ServiceService
Archive ArchiveArchive
Consensus ConsensusConsensus
Raft is
An algorithm with formal verification
Replicated State Machines
Raft is not
A specification
Nor
A complete system
Replicated State Machines
More than Raft
Leader timestamps events
Async, not RPC-based
Timers
The Real World
Service ServiceService
Archive ArchiveArchive
Consensus ConsensusConsensus
*Leader
Client
Benefits
Determinism
Log is immutable
Log can be played, stopped, & replayed
Each event is timestamped
Services restarted from snapshot & log
Benefits
What Can You Do?
Distributed Key/Value Store
Distributed Timers
Distributed Locks
Matching Engines
Order Management
Market Surveillance
P&L, Risk, …
Finance
Venue Ticketing / Reservations
Auctions
Beyond
Hint - a contended database is a good indicator
Why Aeron?
Efficient reliable UDP unicast, UDP
multicast, and IPC message transport
Java, C/C++, C#, Go
Aeron
https://github.com/real-logic/Aeron
And a little bit more…
Very fast Archival & Replay
Aeron
https://github.com/real-logic/Aeron
The “Efficient” bit…
All communications
Aeron publications & subscriptions
Aeron archival & replay
Aeron shared counters
Consensus
based on Aeron stream position
Batching
Critical to efficient operation
Optimizing pipelined throughput
Flow Control
Critical to correct operation
Design for Efficiency?
Cache Hit/Miss Ratios
Branch Prediction
Allocation Rates
Garbage Collection
Inlining
Optimizations
Not… Yet…
Ownership, Dependency, & Coupling
Complexity
Layers of Abstraction (ain’t free)
Resource Management
Closer… But…
Still. Not. Yet.
"AmdahlsLaw" by Daniels220 at English Wikipedia - Own work based on: File:AmdahlsLaw.png. Licensed under CC BY-SA 3.0 via Wikimedia Commons
Universal Scalability Law
0
2
4
6
8
10
12
14
16
18
20
1 2 4 8 16 32 64 128 256 512 1024
Speedup
Processors
Amdahl USL
Breakdown Interactions
Fundamental Sequential Operations
Ingress Message, Sequence, Disseminate
Client
Follower X
Leader
Ingress
Follower Y
Log (multicast or serial unicast)
Member Status
Log
Event
Log
Event
Followers Append
Client
Follower X
Leader
Ingress
Follower Y
Log (multicast or serial unicast)
Member Status
Append
Position
Append
Position
Commit Message
Client
Follower X
Leader
Ingress
Follower Y
Log (multicast or serial unicast)
Member Status
Commit
Position
Commit
Position
Breakdown Interactions
Pipeline-able Operation & Batching
FollowerLeader
Log (multicast or serial unicast)
Member Status
Commit Position @4096
Append Position @6912
Log Event @8192
Stream Positions
Archive Position @8096 Archive Position @7168
Store locally asynchronous to
Position processing by Consensus, &
Log processing by Service
Batching: Log, Appends, Commits
Doesn’t this Complicate Recovery?
Follower
Recovery Positions
Archive Position @8096 Archive Position @7168
A synchronous system doesn’t make this complexity go away!
Election still needs to assert state of the cluster & locally catch-up
Follower Follower
Archive Position @7584
Commit Position @4096 Commit Position @4064 Commit Position @4032
Service Position @4096 Service Position @4064 Service Position @3776
Limitations of Efficiency
Throughput & Latency
Client FollowersLeader
Ingress
Log (multicast or serial unicast)
Member Status
Commit Position
Append Position
Log Event
Client to Service A: 0.5 RTT
Client to Service Ox: 1 RTT
Client to Service A (on Commit): 1.5 RTT
Client to Service Ox (on Commit): 2 RTT
Constant Delay Network
Service A Service Ox
Round-Trip Time (RTT)
Client to Service A: 50ns
Client to Service Ox: 100ns
Client to Service A (on Commit): 150ns
Client to Service Ox (on Commit): 200ns
Limits from Constant Delay
Shared Memory RTT <100ns
Client to Service A: 50us
Client to Service Ox: 100us
Client to Service A (on Commit): 150us
Client to Service Ox (on Commit): 200us
DC RTT <100us
Client to Service A: 5us
Client to Service Ox: 10us
Client to Service A (on Commit): 15us
Client to Service Ox (on Commit): 20us
Rack (Kernel Bypass) RTT <10us
Measured Latency at Throughput
RTT(us)
0
75
150
225
300
Percentile
Min 0.50 0.90 0.99 0.9999 0.999999 Max
100K msgs/sec 200K msgs/sec
Intel Xeon Gold 5118 (2.30GHz, 12 cores)
32GB DDR4 2400 MHz ECC RAM
Intel Optane SSD 900P Series 480GB
SolarFlare X2522-PLUS 10GbE NIC
All servers are connected to an Arista
7150S
CentOS Linux 7.7, kernel
4.4.195-1.el7.elrepo.x86_64 tuned for
low-latency workload.
Courtesy Mark Price
Single client session, bursts of 20x 200B messages, 3-node cluster, Service(s) echo(es) the payload back.
Takeways
Efficiency is part of design
Power of a timestamped, replicated log
Replicated State Machines
Current Status
Aeron Archiving - fully supported
Aeron Clustering - pre-release
Sponsored by
https://weareadaptive.com/
Aeron: https://github.com/real-logic/Aeron
Twitter: @toddlmontgomery
Thank You!
Questions?
StoneTor
Watch the video with slide
synchronization on InfoQ.com!
https://www.infoq.com/presentations/
aeron-cluster-raft/

Fault Tolerance at Speed