Building & Operating
fi
ghting
fi
res & keeping systems up
Why Are Streams Hard?
In streaming architectures, any gaps in non-functional requirements can be unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
Why Are Streams Hard?
In streaming architectures, implementation gaps in non-functional requirements can be
unforgiving
You end up spending a lot of your time
fi
ghting
fi
res & keeping systems up
If you don’t build your systems with the -ilities as
fi
rst class citizens, you pay an
operational tax
… and this translates to unhappy customers and burnt-out team members!
Why Are Streams Hard?
Data Infrastructure is an iceberg


Your customers may only see 10% of your
effort — those that manifest in features


The remaining 90% of your work goes
unnoticed because it relates to keeping the
lights on
Why Are Streams Hard?
Data Infrastructure is an iceberg


Your customers may only see 10% of your
effort — those that manifest in features


The remaining 90% of your work goes
unnoticed because it relates to keeping the
lights on
In this talk, we will build high-
fi
delity
streams-as-a-service from the ground up!
Start Simple
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
Start Simple
Goal : Build a system that can deliver messages from source S to destination D
S D
But
fi
rst, let’s decouple S and D by putting messaging infrastructure between them
E
S D
Events topic
Start Simple
Make a few more implementation decisions about this system
E
S D
Start Simple
Make a few more implementation decisions about this system
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
Make a few more implementation decisions about this system
Operate at low scale
Kafka with a single partition
Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2)
Run S & D on single, separate EC2 Instances
E
S D
Run our system on a cloud platform (e.g. AWS)
Start Simple
T
o make things a bit more interesting, let’s provide our stream as a service


We de
fi
ne our system boundary using a blue box as shown below!
È
S D
Reliability
(Is This System Reliable?)
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Reliability
Goal : Build a system that can deliver messages reliably from S to D
È
S D
Concrete Goal : 0 message loss
Once S has ACKd a message to a remote sender, D must deliver that message to
a remote receiver
Reliability
How do we build reliability into our system?
È
S D
Reliability
Let’s
fi
rst generalize our system!
`
A B C
m1 m1
Reliability
In order to make this system reliable
`
A B C
m1 m1
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
Insight : If each process/link is transactional in nature, the chain will be
transactional!
In order to make this system reliable
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
Reliability
`
A B C
m1 m1
T
reat the messaging system like a chain — it’s only as strong as its weakest link
How do we make each link transactional?
In order to make this system reliable
Insight : If each process/link is transactional in nature, the chain will be
transactional!
T
ransactionality = At least once delivery
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
A is an ingest node
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is an internal node
Reliability
Let’s
fi
rst break this chain into its component processing links
B̀
m1 m1
`
A
m1 m1
` C m1
m1
C is an expel node
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does A need to do?


• Receive a Request (e.g. REST)


• Do some processing


• Reliably send data to Kafka


• kProducer.send(topic, message)


• kProducer.
fl
ush()


• Producer Con
fi
g


• acks = all


• Send HTTP Response to caller
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
What does C need to do?


• Read data (a batch) from Kafka


• Do some processing


• Reliably send data out


• ACK / NACK Kafka


• Consumer Con
fi
g


• enable.auto.commit = false


• ACK moves the read checkpoint
forward


• NACK forces a reread of the same data
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
Reliability
But, how do we handle edge nodes A & C?
B̀
m1 m1
`
A
m1 m1
` C m1
m1
B is a combination of A and C
B needs to act like a reliable Kafka Producer
B needs to act like a reliable Kafka Consumer
`
A B C
m1 m1
Reliability
How reliable is our system now?
Reliability
How reliable is our system now?
What happens if a process crashes?
`
A B C
m1 m1
Reliability
How reliable is our system now?
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
Reliability
How reliable is our system now?
If C crashes, we will stop delivering messages to external consumers!
What happens if a process crashes?
If A crashes, we will have a complete outage at ingestion!
`
A B C
m1 m1
Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent


failures
Reliability
`
A B C
m1 m1
Solution : Place each service in an autoscaling group of size T
`
A B C
m1 m1
T-1 concurrent


failures
For now, we appear to have a pretty reliable data stream
But how do we measure its reliability?
Observability
(A story about Lag & Loss Metrics)
(This brings us to …)
Lag : What is it?
Lag : What is it?
Lag is simply a measure of message delay in a system
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Lag : What is it?
Lag is simply a measure of message delay in a system
The longer a message takes to transit a system, the greater its lag
The greater the lag, the greater the impact to the business
Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
Lag : How do we compute it?
Lag : How do we compute it?
eventTime : the creation time of an event message


Lag can be calculated for any message m1 at any node N in the system as


lag(m1, N) = current_time(m1, N) - eventTime(m1)
`
A B C
m1 m1
T0
eventTime:
Lag : How do we compute it?
Lag-in @


A = T1 - T0 (e.g 1 ms)


B = T3 - T0 (e.g 5 ms)


C = T5 - T0 (e.g 10 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
T0
eventTime:
m1 m1
Lag : How do we compute it?
Lag-in @


A = T1 - T0 (e.g 1 ms)


B = T3 - T0 (e.g 5 ms)


C = T5 - T0 (e.g 10 ms)
Cumulative Lag
Lag : How do we compute it?
Lag-in @


A = T2 - T0 (e.g 3 ms)


B = T4 - T0 (e.g 8 ms)


C = T6 - T0 (e.g 12 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4 T6
T0
eventTime:
m1
Lag : How do we compute it?
Lag-in @


A = T2 - T0 (e.g 3 ms)


B = T4 - T0 (e.g 8 ms)


C = T6 - T0 (e.g 12 ms)
`
A B C
Arrival Lag (Lag-in): time message arrives - eventTime
T1 T3 T5
Departure Lag (Lag-out): time message leaves - eventTime
T2 T4
T0
eventTime:
m1
E2E Lag
E2E Lag is the total time a message spent in the system
T6
Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Lag : How do we compute it?
While it is interesting to know the lag for a particular message m1, it is of little use
since we typically deal with millions of messages
Instead, we prefer statistics (e.g. P95) to capture population behavior
Lag : How do we compute it?
Some useful Lag statistics are:


E2E Lag (p95) : 95th percentile time of messages spent in the system


Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
Lag : How do we compute it?
Some useful Lag statistics are:


E2E Lag (p95) : 95th percentile time of messages spent in the system


Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N


Process_Duration(N, p95) : Lag_out(N, p95) - Lag_in(N, p95)
m1 m1
Process_Duration Graphs show you the contribution to overall Lag from each hop
Lag : How do we compute it?
Loss : What is it?
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Loss : What is it?
Loss is simply a measure of messages lost while transiting the system
Messages can be lost for various reasons, most of which we can mitigate!
The greater the loss, the lower the data quality
Hence, our goal is to minimize loss in order to deliver high quality insights
Loss : How do we compute it?
Loss : How do we compute it?
Concepts : Loss


Loss can be computed as the set difference of messages between any 2
points in the system
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution


Allocate messages to 1-minute wide time buckets using message eventTime


m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
Loss : How do we compute it?
Message Id E2E Loss E2E Loss %
m1 1 1 1 1
m2 1 1 1 1
m3 1 0 0 0
… … … … …
m10 1 1 0 0
Count 10 9 7 5
Per Node Loss(N) 0 1 2 2 5 50%
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
@12:34p
Loss : How do we compute it?
In a streaming data system, messages never stop
fl
owing. So, how do we know when
to count?
Solution


Allocate messages to 1-minute wide time buckets using message eventTime


Wait a few minutes for messages to transit, then compute loss


Raise alarms if loss occurs over a con
fi
gured threshold (e.g. > 1%)


m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m1
m11
m1
m1
m1
m1
m1
m21
m1
m1
m1
m1
m1
m31
We now have a way to measure the reliability (via Loss metrics) and latency (via Lag
metrics) of our system.
Loss : How do we compute it?
But wait…
Performance
(have we tuned our system for performance yet??)
Performance
Goal : Build a system that can deliver messages reliably from S to D with low latency
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
T
o understand streaming system performance, let’s understand the components of E2E Lag
Performance
Ingest Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
Performance
Ingest Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
• This time includes overhead of reliably sending messages to Kafka
Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
Performance
Ingest Time Expel Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
Expel Time : Time to process and egest a message at D.
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
E2E Lag : T
otal time messages spend in the system from message ingest to expel!
Performance
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
S
S
S
…
T
ransit Time : Rest of the time spent in the data pipe (i.e. internal nodes)
Performance Penalties
(T
rading of Latency for Reliability)
Performance : Penalties
In order to have stream reliability, we must sacri
fi
ce latency!


How can we handle our performance penalties?
Performance
Challenge 1 : Ingest Penalty


In the name of reliability, S needs to call kProducer.
fl
ush() on every inbound API
request


S also needs to wait for 3 ACKS from Kafka before sending its API response
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
Challenge 1 : Ingest Penalty


Approach : Amortization


Support Batch APIs (i.e. multiple messages per web request) to amortize the
ingest penalty
E2E Lag
Ingest Time Expel Time
T
ransit Time
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty


Observations


Kafka is very fast — many orders of magnitude faster than HTTP RTT
s


The majority of the expel time is the HTTP RTT
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
E2E Lag
Ingest Time Expel Time
T
ransit Time
Challenge 2 : Expel Penalty


Approach : Amortization


In each D node, add batch + parallelism
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
Performance
Challenge 3 : Retry Penalty (@ D)


Concepts


In order to run a zero-loss pipeline, we need to retry messages @ D that will
succeed given enough attempts
We call these Recoverable Failures
In contrast, we should never retry a message that has 0 chance of success!
We call these Non-Recoverable Failures
E.g. Any 4xx HTTP response code, except for 429 (T
oo Many Requests)
Performance
Challenge 3 : Retry Penalty


Approach


We pay a latency penalty on retry, so we need to smart about


What we retry — Don’t retry any non-recoverable failures


How we retry
Performance
Challenge 3 : Retry Penalty


Approach


We pay a latency penalty on retry, so we need to smart about


What we retry — Don’t retry any non-recoverable failures


How we retry — One Idea : Tiered Retries
Performance - Tiered Retries
Local Retries


T
ry to send message a
con
fi
gurable number of times @ D
Global Retries
Performance - Tiered Retries
Local Retries


T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
Performance - Tiered Retries
Local Retries


T
ry to send message a
con
fi
gurable number of times @ D
If we exhaust local retries, D
transfers the message to a Global
Retrier
Global Retries
The Global Retrier than retries
the message over a longer span of
time
`
E
S
S
S
S
S
S
S
D
RO
RI
S
S
S
gr
Performance - 2 Tiered Retries
RI : Retry_In


RO : Retry_Out
At this point, we have a system that works well at low scale
Performance
Scalability
Scalability
First, Let’s dispel a myth!
Scalability
First, Let’s dispel a myth!
Each system is traf
fi
c-rated
The traf
fi
c rating comes from running load tests
There is no such thing as a system that can handle in
fi
nite scale
Scalability
First, Let’s dispel a myth!
Each system is traf
fi
c-rated
The traf
fi
c rating comes from running load tests
There is no such thing as a system that can handle in
fi
nite scale
We only achieve higher scale by iteratively running load tests & removing bottlenecks
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
Scalability - Autoscaling
Autoscaling Goals (for data streams):


Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag)


Goal 2: Automatically scale in to minimize cost
Autoscaling Considerations
What can autoscale? What can’t autoscale?
Scalability - Autoscaling EC2
The most important part of autoscaling is picking the right metric to trigger
autoscaling actions
Scalability - Autoscaling EC2
Pick a metric that


Preserves low latency


Goes up as traf
fi
c increases


Goes down as the microservice scales out
Scalability - Autoscaling EC2
Pick a metric that


Preserves low latency


Goes up as traf
fi
c increases


Goes down as the microservice scales out
E.g.


Average CPU
Scalability - Autoscaling EC2
Pick a metric that


Preserves low latency


Goes up as traf
fi
c increases


Goes down as the microservice scales out
E.g.


Average CPU
What to be wary of


Any locks/code synchronization & IO Waits
Otherwise … As traf
fi
c increases, CPU will plateau, auto-
scale-out will stop, and latency (i.e. E2E Lag) will increase
What Next?
We now have a system with the Non-functional Requirements (NFRs) that we desire!
What Next?
What if we want to handle


• Different types of messages


• More complex processing ( i.e. more processing stages)


• More complex stream topologies (e.g. 1-1, 1-many, many-many)
What Next?
It will take a lot of work to rebuild our data pipe for each variation of customers’ needs!
What we need to do is build a more generic Streams-as-a-Service (ST
aaS) platform!
Building StaaS
Firstly, let’s make our pipeline a bit more realistic by adding more processing stages
S
S
S
S
S
S
S
D
S
S
S
…
SS
S
…
Ingest Normalize Enrich Route T
ransform T
ransmit
Building StaaS
And by handling more complex topologies (e.g. many-to-many)
Normalize Enrich Route T
ransform
T
ransmit
(T1)
T
ransmit
(T2)
T
ransmit
(T3)
T
ransmit
(T4)
T
ransmit
(T5)
Ingest (n1)
Ingest (n2)
Ingest (n3)
Building StaaS
This our data plane — it send messages from multiple sources to multiple destinations
Normalize Enrich Route T
ransform
T
ransmit
(T1)
T
ransmit
(T2)
T
ransmit
(T3)
T
ransmit
(T4)
T
ransmit
(T5)
Ingest (n1)
Ingest (n2)
Ingest (n3)
Building StaaS
But, we also want to allow users the ability to de
fi
ne their own data pipes in this data
plane
Building StaaS
Hence, we need a management plane to capture the intent of the users
Admin
FE
Admin
BE
Data


Plane
Building StaaS
We now have 2 planes: a Management Plane & a Data Plane
Admin
FE
Admin
BE
Management


Plane
Building StaaS
Hence, we need at least 2 planes : Management & Data
Data


Plane
Admin
FE
Admin
BE
Management


Plane
Building StaaS
Data


Plane
Admin
FE
Admin
BE
Management


Plane
Control


Plane
P
We also need a Provisioner(P)
Building StaaS
Data


Plane
Admin
FE
Admin
BE
Management


Plane
Control


Plane
P
We also need a Deployer(D)
D
Building StaaS
Data


Plane
Admin
FE
Admin
BE
Management


Plane
Control


Plane
O
A
Finally, we can add systems to promote health and stability: Observer(O) & Autoscaler (A)
P
D
Building StaaS
Data


Plane
Admin
FE
Admin
BE
Management


Plane
Control


Plane
O
A
T
ogether these 4 services form the Control Plane
P
D
Building StaaS
The Control Plane T
opology is a diamond-cross
Control


Plane
O
D
A
P
Building StaaS
• The observer(O) is the source of truth for system health


• It is aware of D, P, and A activity & may quiet alarms
during certain actions
Control


Plane
O
D
A
P
Building StaaS
• The observer(O) is the source of truth for system health


• It is aware of D, P, and A activity & may quiet alarms
during certain actions
Control


Plane
O
D
A
P
• It can collect and monitor more complex health
metrics than lag and loss. For example, in ML
pipelines, it can track scoring skew
Building StaaS
• The observer(O) is the source of truth for system health


• It is aware of D, P, and A activity & may quiet alarms
during certain actions
Control


Plane
O
D
A
P
• The system can also detect common causes of non-
recoverable failures & alert customers
• It can collect and monitor more complex health
metrics than lag and loss. For example, in ML
pipelines, it can track scoring skew
Building StaaS
• The deployer(D) deploys new code to the data plane


• It can however not deploy if the system is unstable or
autoscaling


• It can also automatically roll back if the system becomes
unstable due to deployment
Control


Plane
O
D
A
P
Building StaaS
• The provisioner(P) deploys customer data pipes to the
system.


• It can pause if the system is unstable or autoscaling
Control


Plane
O
D
A
P
Building StaaS
• The provisioner(P) deploys customer data pipes to the
system.


• It can pause if the system is unstable or autoscaling
Control


Plane
O
D
A
P
• The provisioner(P) can also control things like phased
traf
fi
c ramp ups for new deployed pipelines
Conclusion
Conclusion
• We have built a Streams-as-a-Service system with many NFRs as
fi
rst class
citizens
• While we’ve covered many key elements, a few areas will be covered in future
talks (e.g. Isolation, Containerization, Caching)
• Should you have questions, join me for Q&A and follow for more on
(@r39132)
Thank You for your Time
And thanks to the many people who
help build these systems with me..
• Vincent Chen


• Anisha Nainani


• Pramod Garre


• Harsh Bhimani


• Nirmalya Ghosh


• Yash Shah


• Aastha Sinha


• Esther Kent


• Dheeraj Rampali
• Deepak Chandramouli


• Prasanna Krishna


• Sandhu Santhakumar


• Maneesh CM & team at Active
Lobby


• Shiju & the team at Xminds


• Bob Carlson


• T
ony Gentille


• Josh Evans

YOW! Data Keynote (2021)

  • 1.
  • 17.
  • 18.
    Why Are StreamsHard? In streaming architectures, any gaps in non-functional requirements can be unforgiving You end up spending a lot of your time fi ghting fi res & keeping systems up If you don’t build your systems with the -ilities as fi rst class citizens, you pay an operational tax
  • 19.
    Why Are StreamsHard? In streaming architectures, implementation gaps in non-functional requirements can be unforgiving You end up spending a lot of your time fi ghting fi res & keeping systems up If you don’t build your systems with the -ilities as fi rst class citizens, you pay an operational tax … and this translates to unhappy customers and burnt-out team members!
  • 20.
    Why Are StreamsHard? Data Infrastructure is an iceberg Your customers may only see 10% of your effort — those that manifest in features The remaining 90% of your work goes unnoticed because it relates to keeping the lights on
  • 21.
    Why Are StreamsHard? Data Infrastructure is an iceberg Your customers may only see 10% of your effort — those that manifest in features The remaining 90% of your work goes unnoticed because it relates to keeping the lights on In this talk, we will build high- fi delity streams-as-a-service from the ground up!
  • 22.
  • 23.
    Start Simple Goal :Build a system that can deliver messages from source S to destination D S D
  • 24.
    Start Simple Goal :Build a system that can deliver messages from source S to destination D S D But fi rst, let’s decouple S and D by putting messaging infrastructure between them E S D Events topic
  • 25.
    Start Simple Make afew more implementation decisions about this system E S D
  • 26.
    Start Simple Make afew more implementation decisions about this system E S D Run our system on a cloud platform (e.g. AWS)
  • 27.
    Start Simple Make afew more implementation decisions about this system Operate at low scale E S D Run our system on a cloud platform (e.g. AWS)
  • 28.
    Start Simple Make afew more implementation decisions about this system Operate at low scale Kafka with a single partition E S D Run our system on a cloud platform (e.g. AWS)
  • 29.
    Start Simple Make afew more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) E S D Run our system on a cloud platform (e.g. AWS)
  • 30.
    Start Simple Make afew more implementation decisions about this system Operate at low scale Kafka with a single partition Kafka across 3 brokers split across AZs with RF=3 (min in-sync replicas =2) Run S & D on single, separate EC2 Instances E S D Run our system on a cloud platform (e.g. AWS)
  • 31.
    Start Simple T o makethings a bit more interesting, let’s provide our stream as a service We de fi ne our system boundary using a blue box as shown below! È S D
  • 32.
  • 33.
    Reliability Goal : Builda system that can deliver messages reliably from S to D È S D
  • 34.
    Reliability Goal : Builda system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss
  • 35.
    Reliability Goal : Builda system that can deliver messages reliably from S to D È S D Concrete Goal : 0 message loss Once S has ACKd a message to a remote sender, D must deliver that message to a remote receiver
  • 36.
    Reliability How do webuild reliability into our system? È S D
  • 37.
  • 38.
    Reliability In order tomake this system reliable ` A B C m1 m1
  • 39.
    Reliability ` A B C m1m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable
  • 40.
    Reliability ` A B C m1m1 T reat the messaging system like a chain — it’s only as strong as its weakest link Insight : If each process/link is transactional in nature, the chain will be transactional! In order to make this system reliable
  • 41.
    Reliability ` A B C m1m1 T reat the messaging system like a chain — it’s only as strong as its weakest link In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  • 42.
    Reliability ` A B C m1m1 T reat the messaging system like a chain — it’s only as strong as its weakest link How do we make each link transactional? In order to make this system reliable Insight : If each process/link is transactional in nature, the chain will be transactional! T ransactionality = At least once delivery
  • 43.
    Reliability Let’s fi rst break thischain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1
  • 44.
    Reliability Let’s fi rst break thischain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 A is an ingest node
  • 45.
    Reliability Let’s fi rst break thischain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is an internal node
  • 46.
    Reliability Let’s fi rst break thischain into its component processing links B̀ m1 m1 ` A m1 m1 ` C m1 m1 C is an expel node
  • 47.
    Reliability But, how dowe handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does A need to do? 
 • Receive a Request (e.g. REST) • Do some processing • Reliably send data to Kafka • kProducer.send(topic, message) • kProducer. fl ush() • Producer Con fi g • acks = all • Send HTTP Response to caller
  • 48.
    Reliability But, how dowe handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 What does C need to do? 
 • Read data (a batch) from Kafka • Do some processing • Reliably send data out • ACK / NACK Kafka • Consumer Con fi g • enable.auto.commit = false • ACK moves the read checkpoint forward • NACK forces a reread of the same data
  • 49.
    Reliability But, how dowe handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C
  • 50.
    Reliability But, how dowe handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer
  • 51.
    Reliability But, how dowe handle edge nodes A & C? B̀ m1 m1 ` A m1 m1 ` C m1 m1 B is a combination of A and C B needs to act like a reliable Kafka Producer B needs to act like a reliable Kafka Consumer
  • 52.
    ` A B C m1m1 Reliability How reliable is our system now?
  • 53.
    Reliability How reliable isour system now? What happens if a process crashes? ` A B C m1 m1
  • 54.
    Reliability How reliable isour system now? What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  • 55.
    Reliability How reliable isour system now? If C crashes, we will stop delivering messages to external consumers! What happens if a process crashes? If A crashes, we will have a complete outage at ingestion! ` A B C m1 m1
  • 56.
    Reliability ` A B C m1m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures
  • 57.
    Reliability ` A B C m1m1 Solution : Place each service in an autoscaling group of size T ` A B C m1 m1 T-1 concurrent failures For now, we appear to have a pretty reliable data stream
  • 58.
    But how dowe measure its reliability?
  • 59.
    Observability (A story aboutLag & Loss Metrics) (This brings us to …)
  • 60.
    Lag : Whatis it?
  • 61.
    Lag : Whatis it? Lag is simply a measure of message delay in a system
  • 62.
    Lag : Whatis it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag
  • 63.
    Lag : Whatis it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business
  • 64.
    Lag : Whatis it? Lag is simply a measure of message delay in a system The longer a message takes to transit a system, the greater its lag The greater the lag, the greater the impact to the business Hence, our goal is to minimize lag in order to deliver insights as quickly as possible
  • 65.
    Lag : Howdo we compute it?
  • 66.
    Lag : Howdo we compute it? eventTime : the creation time of an event message Lag can be calculated for any message m1 at any node N in the system as lag(m1, N) = current_time(m1, N) - eventTime(m1) ` A B C m1 m1 T0 eventTime:
  • 67.
    Lag : Howdo we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 5 ms) C = T5 - T0 (e.g 10 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1
  • 68.
    ` A B C ArrivalLag (Lag-in): time message arrives - eventTime T1 T3 T5 T0 eventTime: m1 m1 Lag : How do we compute it? Lag-in @ A = T1 - T0 (e.g 1 ms) B = T3 - T0 (e.g 5 ms) C = T5 - T0 (e.g 10 ms) Cumulative Lag
  • 69.
    Lag : Howdo we compute it? Lag-in @ A = T2 - T0 (e.g 3 ms) B = T4 - T0 (e.g 8 ms) C = T6 - T0 (e.g 12 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T6 T0 eventTime: m1
  • 70.
    Lag : Howdo we compute it? Lag-in @ A = T2 - T0 (e.g 3 ms) B = T4 - T0 (e.g 8 ms) C = T6 - T0 (e.g 12 ms) ` A B C Arrival Lag (Lag-in): time message arrives - eventTime T1 T3 T5 Departure Lag (Lag-out): time message leaves - eventTime T2 T4 T0 eventTime: m1 E2E Lag E2E Lag is the total time a message spent in the system T6
  • 71.
    Lag : Howdo we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages
  • 72.
    Lag : Howdo we compute it? While it is interesting to know the lag for a particular message m1, it is of little use since we typically deal with millions of messages Instead, we prefer statistics (e.g. P95) to capture population behavior
  • 73.
    Lag : Howdo we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N
  • 74.
    Lag : Howdo we compute it? Some useful Lag statistics are: E2E Lag (p95) : 95th percentile time of messages spent in the system Lag_[in|out](N, p95): P95 Lag_in or Lag_out at any Node N Process_Duration(N, p95) : Lag_out(N, p95) - Lag_in(N, p95)
  • 75.
    m1 m1 Process_Duration Graphsshow you the contribution to overall Lag from each hop Lag : How do we compute it?
  • 76.
  • 77.
    Loss : Whatis it? Loss is simply a measure of messages lost while transiting the system
  • 78.
    Loss : Whatis it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate!
  • 79.
    Loss : Whatis it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality
  • 80.
    Loss : Whatis it? Loss is simply a measure of messages lost while transiting the system Messages can be lost for various reasons, most of which we can mitigate! The greater the loss, the lower the data quality Hence, our goal is to minimize loss in order to deliver high quality insights
  • 81.
    Loss : Howdo we compute it?
  • 82.
    Loss : Howdo we compute it? Concepts : Loss Loss can be computed as the set difference of messages between any 2 points in the system m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  • 83.
    Loss : Howdo we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1
  • 84.
    Loss : Howdo we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 85.
    Loss : Howdo we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime 
 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 86.
    Loss : Howdo we compute it? Message Id E2E Loss E2E Loss % m1 1 1 1 1 m2 1 1 1 1 m3 1 0 0 0 … … … … … m10 1 1 0 0 Count 10 9 7 5 Per Node Loss(N) 0 1 2 2 5 50% m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 @12:34p
  • 87.
    Loss : Howdo we compute it? In a streaming data system, messages never stop fl owing. So, how do we know when to count? Solution Allocate messages to 1-minute wide time buckets using message eventTime Wait a few minutes for messages to transit, then compute loss Raise alarms if loss occurs over a con fi gured threshold (e.g. > 1%) 
 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m1 m11 m1 m1 m1 m1 m1 m21 m1 m1 m1 m1 m1 m31
  • 88.
    We now havea way to measure the reliability (via Loss metrics) and latency (via Lag metrics) of our system. Loss : How do we compute it? But wait…
  • 89.
    Performance (have we tunedour system for performance yet??)
  • 90.
    Performance Goal : Builda system that can deliver messages reliably from S to D with low latency S S S S S S S D S S S … SS S … T o understand streaming system performance, let’s understand the components of E2E Lag
  • 91.
    Performance Ingest Time S S S S S S S D S S S … S S S … Ingest Time: Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  • 92.
    Performance Ingest Time S S S S S S S D S S S … S S S … • Thistime includes overhead of reliably sending messages to Kafka Ingest Time : Time from Last_Byte_In_of_Request to First_Byte_Out_of_Response
  • 93.
    Performance Ingest Time ExpelTime S S S S S S S D S S S … S S S … Expel Time : Time to process and egest a message at D.
  • 94.
    Performance E2E Lag Ingest TimeExpel Time T ransit Time S S S S S S S D S S S … S S S … E2E Lag : T otal time messages spend in the system from message ingest to expel!
  • 95.
    Performance Ingest Time ExpelTime T ransit Time S S S S S S S D S S S … S S S … T ransit Time : Rest of the time spent in the data pipe (i.e. internal nodes)
  • 96.
    Performance Penalties (T rading ofLatency for Reliability)
  • 97.
    Performance : Penalties Inorder to have stream reliability, we must sacri fi ce latency! How can we handle our performance penalties?
  • 98.
    Performance Challenge 1 :Ingest Penalty In the name of reliability, S needs to call kProducer. fl ush() on every inbound API request S also needs to wait for 3 ACKS from Kafka before sending its API response E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  • 99.
    Performance Challenge 1 :Ingest Penalty Approach : Amortization Support Batch APIs (i.e. multiple messages per web request) to amortize the ingest penalty E2E Lag Ingest Time Expel Time T ransit Time S S S S S S S D S S S … SS S …
  • 100.
    Performance E2E Lag Ingest TimeExpel Time T ransit Time Challenge 2 : Expel Penalty Observations Kafka is very fast — many orders of magnitude faster than HTTP RTT s The majority of the expel time is the HTTP RTT S S S S S S S D S S S … SS S …
  • 101.
    Performance E2E Lag Ingest TimeExpel Time T ransit Time Challenge 2 : Expel Penalty Approach : Amortization In each D node, add batch + parallelism S S S S S S S D S S S … SS S …
  • 102.
    Performance Challenge 3 :Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts
  • 103.
    Performance Challenge 3 :Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures
  • 104.
    Performance Challenge 3 :Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures
  • 105.
    Performance Challenge 3 :Retry Penalty (@ D) Concepts In order to run a zero-loss pipeline, we need to retry messages @ D that will succeed given enough attempts We call these Recoverable Failures In contrast, we should never retry a message that has 0 chance of success! We call these Non-Recoverable Failures E.g. Any 4xx HTTP response code, except for 429 (T oo Many Requests)
  • 106.
    Performance Challenge 3 :Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry
  • 107.
    Performance Challenge 3 :Retry Penalty Approach We pay a latency penalty on retry, so we need to smart about What we retry — Don’t retry any non-recoverable failures How we retry — One Idea : Tiered Retries
  • 108.
    Performance - TieredRetries Local Retries T ry to send message a con fi gurable number of times @ D Global Retries
  • 109.
    Performance - TieredRetries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries
  • 110.
    Performance - TieredRetries Local Retries T ry to send message a con fi gurable number of times @ D If we exhaust local retries, D transfers the message to a Global Retrier Global Retries The Global Retrier than retries the message over a longer span of time
  • 111.
    ` E S S S S S S S D RO RI S S S gr Performance - 2Tiered Retries RI : Retry_In 
 RO : Retry_Out
  • 112.
    At this point,we have a system that works well at low scale Performance
  • 113.
  • 114.
  • 115.
    Scalability First, Let’s dispela myth! Each system is traf fi c-rated The traf fi c rating comes from running load tests There is no such thing as a system that can handle in fi nite scale
  • 116.
    Scalability First, Let’s dispela myth! Each system is traf fi c-rated The traf fi c rating comes from running load tests There is no such thing as a system that can handle in fi nite scale We only achieve higher scale by iteratively running load tests & removing bottlenecks
  • 117.
    Scalability - Autoscaling AutoscalingGoals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  • 118.
    Scalability - Autoscaling AutoscalingGoals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost
  • 119.
    Scalability - Autoscaling AutoscalingGoals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale?
  • 120.
    Scalability - Autoscaling AutoscalingGoals (for data streams): Goal 1: Automatically scale out to maintain low latency (e.g. E2E Lag) Goal 2: Automatically scale in to minimize cost Autoscaling Considerations What can autoscale? What can’t autoscale?
  • 121.
    Scalability - AutoscalingEC2 The most important part of autoscaling is picking the right metric to trigger autoscaling actions
  • 122.
    Scalability - AutoscalingEC2 Pick a metric that Preserves low latency Goes up as traf fi c increases Goes down as the microservice scales out
  • 123.
    Scalability - AutoscalingEC2 Pick a metric that Preserves low latency Goes up as traf fi c increases Goes down as the microservice scales out E.g. Average CPU
  • 124.
    Scalability - AutoscalingEC2 Pick a metric that Preserves low latency Goes up as traf fi c increases Goes down as the microservice scales out E.g. Average CPU What to be wary of Any locks/code synchronization & IO Waits Otherwise … As traf fi c increases, CPU will plateau, auto- scale-out will stop, and latency (i.e. E2E Lag) will increase
  • 125.
    What Next? We nowhave a system with the Non-functional Requirements (NFRs) that we desire!
  • 126.
    What Next? What ifwe want to handle • Different types of messages • More complex processing ( i.e. more processing stages) • More complex stream topologies (e.g. 1-1, 1-many, many-many)
  • 127.
    What Next? It willtake a lot of work to rebuild our data pipe for each variation of customers’ needs! What we need to do is build a more generic Streams-as-a-Service (ST aaS) platform!
  • 128.
    Building StaaS Firstly, let’smake our pipeline a bit more realistic by adding more processing stages S S S S S S S D S S S … SS S … Ingest Normalize Enrich Route T ransform T ransmit
  • 129.
    Building StaaS And byhandling more complex topologies (e.g. many-to-many) Normalize Enrich Route T ransform T ransmit (T1) T ransmit (T2) T ransmit (T3) T ransmit (T4) T ransmit (T5) Ingest (n1) Ingest (n2) Ingest (n3)
  • 130.
    Building StaaS This ourdata plane — it send messages from multiple sources to multiple destinations Normalize Enrich Route T ransform T ransmit (T1) T ransmit (T2) T ransmit (T3) T ransmit (T4) T ransmit (T5) Ingest (n1) Ingest (n2) Ingest (n3)
  • 131.
    Building StaaS But, wealso want to allow users the ability to de fi ne their own data pipes in this data plane
  • 132.
    Building StaaS Hence, weneed a management plane to capture the intent of the users Admin FE Admin BE
  • 133.
    Data Plane Building StaaS We nowhave 2 planes: a Management Plane & a Data Plane Admin FE Admin BE Management Plane
  • 134.
    Building StaaS Hence, weneed at least 2 planes : Management & Data Data Plane Admin FE Admin BE Management Plane
  • 135.
  • 136.
  • 137.
    Building StaaS Data Plane Admin FE Admin BE Management Plane Control Plane O A Finally, wecan add systems to promote health and stability: Observer(O) & Autoscaler (A) P D
  • 138.
  • 139.
    Building StaaS The ControlPlane T opology is a diamond-cross Control Plane O D A P
  • 140.
    Building StaaS • Theobserver(O) is the source of truth for system health • It is aware of D, P, and A activity & may quiet alarms during certain actions Control Plane O D A P
  • 141.
    Building StaaS • Theobserver(O) is the source of truth for system health • It is aware of D, P, and A activity & may quiet alarms during certain actions Control Plane O D A P • It can collect and monitor more complex health metrics than lag and loss. For example, in ML pipelines, it can track scoring skew
  • 142.
    Building StaaS • Theobserver(O) is the source of truth for system health • It is aware of D, P, and A activity & may quiet alarms during certain actions Control Plane O D A P • The system can also detect common causes of non- recoverable failures & alert customers • It can collect and monitor more complex health metrics than lag and loss. For example, in ML pipelines, it can track scoring skew
  • 143.
    Building StaaS • Thedeployer(D) deploys new code to the data plane • It can however not deploy if the system is unstable or autoscaling • It can also automatically roll back if the system becomes unstable due to deployment Control Plane O D A P
  • 144.
    Building StaaS • Theprovisioner(P) deploys customer data pipes to the system. • It can pause if the system is unstable or autoscaling Control Plane O D A P
  • 145.
    Building StaaS • Theprovisioner(P) deploys customer data pipes to the system. • It can pause if the system is unstable or autoscaling Control Plane O D A P • The provisioner(P) can also control things like phased traf fi c ramp ups for new deployed pipelines
  • 146.
  • 147.
    Conclusion • We havebuilt a Streams-as-a-Service system with many NFRs as fi rst class citizens • While we’ve covered many key elements, a few areas will be covered in future talks (e.g. Isolation, Containerization, Caching) • Should you have questions, join me for Q&A and follow for more on (@r39132)
  • 148.
    Thank You foryour Time
  • 149.
    And thanks tothe many people who help build these systems with me.. • Vincent Chen • Anisha Nainani • Pramod Garre • Harsh Bhimani • Nirmalya Ghosh • Yash Shah • Aastha Sinha • Esther Kent • Dheeraj Rampali • Deepak Chandramouli • Prasanna Krishna • Sandhu Santhakumar • Maneesh CM & team at Active Lobby • Shiju & the team at Xminds • Bob Carlson • T ony Gentille • Josh Evans