Future with Zero Down-Time:
End-to-end resiliency with chaos engineering and lens of observability
S Vinod Kumar
Event Streaming Platform Team
Fidelity Investments
Fishing for Errors
Increase Linger
First step of solving a problem is to identify the problem
Throughput
Debugging
Kafka
Horizontal Broker Scaling
Memory
RAM
CPU
This Time Its Happening!!!???
Client-SideObservability
Enhanced Monitoring on Kafka
K a f k a H e a l t h
S u m m a r y
S y s t e m
M e t r i c s
L a t e n c y
M e t r i c s
T h r o u g h p u t
M e t r i c s
P r o d u c e r s C o n s u m e r s
P r o d u c e r s C o n s u m e r s
Streaming Platform – Resiliency Test Framework
Producers
Consumers
Breakdown
Normality
Self-healing
Recovery
Prod Release
Optimize
Analyze
Test
App Team / Developers
Chaos Mesh / AWS FIS
CPU/Memory/I
O Stress
Network
Latency &
Jitter
Single Broker
Failure
Single AZ Down
Scheduled Chaos
Grey
Failures
Hard
Failures
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
AZ1 AZ3
AZ2
Network
outage
Large
Network
Latency
Multi Broker
Down
Region Down
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Broker
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
Kafka
Replicator
Schema
Connect
Apps Tolerate Apps Fail-Over to DR
Streaming Platform – Resiliency Test Framework
Kafka Client
Applications
Optimize Quotas
and re-evaluate
Thank You!
Kafka Summit London 2024
S Vinod Kumar
/s-vinod-kumar
F I N D Y O U R F I D E L I T Y

Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and Lens of Observability

  • 1.
    Future with ZeroDown-Time: End-to-end resiliency with chaos engineering and lens of observability S Vinod Kumar Event Streaming Platform Team Fidelity Investments
  • 2.
    Fishing for Errors IncreaseLinger First step of solving a problem is to identify the problem Throughput Debugging Kafka Horizontal Broker Scaling Memory RAM CPU This Time Its Happening!!!???
  • 3.
    Client-SideObservability Enhanced Monitoring onKafka K a f k a H e a l t h S u m m a r y S y s t e m M e t r i c s L a t e n c y M e t r i c s T h r o u g h p u t M e t r i c s P r o d u c e r s C o n s u m e r s P r o d u c e r s C o n s u m e r s
  • 4.
    Streaming Platform –Resiliency Test Framework Producers Consumers Breakdown Normality Self-healing Recovery Prod Release Optimize Analyze Test App Team / Developers
  • 5.
    Chaos Mesh /AWS FIS CPU/Memory/I O Stress Network Latency & Jitter Single Broker Failure Single AZ Down Scheduled Chaos Grey Failures Hard Failures Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Replicator Schema Connect Kafka Replicator Schema Connect Kafka Replicator Schema Connect AZ1 AZ3 AZ2 Network outage Large Network Latency Multi Broker Down Region Down Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Broker Kafka Replicator Schema Connect Kafka Replicator Schema Connect Kafka Replicator Schema Connect Apps Tolerate Apps Fail-Over to DR
  • 6.
    Streaming Platform –Resiliency Test Framework Kafka Client Applications Optimize Quotas and re-evaluate
  • 7.
    Thank You! Kafka SummitLondon 2024 S Vinod Kumar /s-vinod-kumar F I N D Y O U R F I D E L I T Y