@NSilnitsky
Migrating to Multi Cluster Managed Kafka
Migrating to a Multi-Cluster
Managed Kafka with 0 Downtime
Natan Silnitsky Backend Infra TL, Wix.com
natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
@NSilnitsky
Migrating to Multi Cluster Managed Kafka
@NSilnitsky
Kafka in Wix
Migrating to Multi Cluster Managed Kafka
2019
1 cluster, self hosted
5 K
Topics
> 45 K
Partitions
~ 450 M
Messages
produced a day
Kafka Broker
@NSilnitsky
Migrating to Multi Cluster Managed Kafka
Kafka in Wix Today
2021
1 cluster, self hosted
20 K
Topics
> 200 K
Partitions
2.5 B
Messages
produced a day
@NSilnitsky
Migrating to Multi Cluster Managed Kafka
2021
1 cluster, self hosted
To multi cluster,
managed Kafka platform
So, migrate all this
20 K
Topics
> 200 K
Partitions
2.5 B
Messages
produced a day
overloaded
1. Better Cluster
performance & flexibility
2. Transparent version
upgrade
3. Easy to add a new Cluster
4. Tiered Storage
@NSilnitsky
Migrating to Multi Cluster Managed Kafka
Wix wraps Kafka with Greyhound, a Scala/Java high-level SDK.
~2000 Wix microservices
Kafka Producer Kafka Consumer
Greyhound Producer Greyhound Consumer
@NSilnitsky
Migrating to Multi Cluster Managed Kafka
Kafka Producer
Greyhound Producer Greyhound Consumer
Kafka Consumer
Kafka Broker
Checkout
Service
example
Payments
Service
Agenda 1. The Multi Cluster (Kafka)
2. The Migration
3. What to Expect
Migrating to Multi Cluster Managed Kafka
@NSilnitsky
Kafka Brokers
Cluster
DC1 DC2
DC3 DC4
Single
Cluster
Overload
* intra-dc
@NSilnitsky
DC1 DC2
DC3 DC4
Single
Cluster
Overload
A
A
A
A
B
Kafka Cluster (by SLAs)
B
Multi
Cluster
C
@NSilnitsky
DC1 DC2
DC3 DC4
To which
Cluster?
Kafka Cluster (by SLAs)
A
A
A
A
B
B
C
I want to
produce a
domain event.
@NSilnitsky
aGreyhoundConsumerSpec(groupName, messageHandler, topicName)
.ClusterA
val producer = GreyhoundBuilder.resilientProducer(topicName,
_.ClusterB
The Multi Cluster (Kafka)
@NSilnitsky
Cluster A + DC1 Cluster A + DC2
The Multi Cluster (Kafka)
Agenda 1. The Multi Cluster (Kafka)
2. The Migration
3. What to Expect
Migrating to Multi Cluster Managed Kafka
@NSilnitsky
Unbalanced brokers
Unclear Kafka strategy
Too many partitions
Real production impact
→
→
→
→
Our Starting Point
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
@NSilnitsky
Migrate on 0-drain Traffic?
Migrations
we ❤
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
@NSilnitsky
(Blockers)
→ Specific DC services
→ Long time
→ Not gradual - Edge cases risk
Q4 2020: CANCELED
Migrate on 0-drain Traffic?
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
@NSilnitsky
… HAS to be
Seamless &
Production-safe.
Migrate With Traffic!
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
@NSilnitsky
Automate, Automate, Automate.
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
managed
optimized
… HAS to be
Seamless &
Production-safe.
Option B
The Migration
HTTP
Service
Greyhound
Kafka
Cluster
Self-hosted
Data Center 1
Confluent Cloud
@NSilnitsky
Option B
The Migration
Kafka
Cluster
Self-hosted
Data Center 1
Confluent Cloud
Replicator
service
@NSilnitsky
Option B
The Migration
Data Center 1
Topic 1
Partition
5
4
5
5
5
6
5
7
5
8
5
9
6
0
6
1
6
2
6
3
6
4
4 5 6 7 8 9
1
0
0 1 2 3
1. Consume
3. Save offset Mapping
2. Produce
Confluent Cloud
Replicator
service
Kafka
Cluster
Self-hosted
@NSilnitsky
Option B
The Migration
Greyhound
Kafka
Cluster
Self-hosted
Data Center 1
Replicator
service
Migration
Orchestrator
Confluent →
@NSilnitsky
Option B
The Migration
Greyhound
Kafka
Cluster
Self-hosted
Data Center 1
Replicator
service
Migration
Orchestrator
group of
consumers
Migrate
Consumer:
1. Replicate to Confluent
Confluent →
@NSilnitsky
Greyhound
Kafka
Cluster
Self-hosted
Replicator
service
Migration
Orchestrator
Listen to
events
Option B
The Migration
Subscribe
Data Center 1
Unsubscribe
Migrate
Consumer:
1. Replicate to Confluent
2. Unsubscribe from
self-hosted Kafka Cluster
3. Subscribe to Confluent
4. Rollback
(Seek offset)
success!
Failure!
@NSilnitsky
Confluent →
@NSilnitsky
The Migration
Best Practices
1. Create a script that checks state by itself and stops if expected state is not reached.
2. Have a rollback readily available.
3. Start with test topics and no impact topics
4. Create custom metrics dashboards that show current state.
The Migration
Agenda 1. The Multi Cluster (Kafka)
2. The Migration
3. What to Expect
Migrating to Multi Cluster Managed Kafka
@NSilnitsky
Replicator
service
On-Prem
Kafka
Cluster
Managed
Kafka
Cluster
Topic 2
Topic N
Topic 1
What to Expect when Migrating to Multi Cluster
During Nov/Dec 2021
@NSilnitsky
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
Replicator
service
On-Prem
Kafka
Cluster
Managed
Kafka
Cluster
Topic 2
Topic N
Topic 1
What to Expect when Migrating to Multi Cluster
During Nov/Dec 2021
@NSilnitsky
What to Expect when Migrating to Multi Cluster
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=4194304
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
@NSilnitsky
What to Expect when Migrating to Multi Cluster
Replicator
service
On-Prem
Kafka
Cluster
Managed
Kafka
Cluster
Topic 2
Topic N
Topic 1
It’s Christmas eve 🎄 2021.
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
@NSilnitsky
What to Expect when Migrating to Multi Cluster
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=4194304
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=8388608
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
@NSilnitsky
What to Expect when Migrating to Multi Cluster
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=4194304
kafka-configs.sh --bootstrap-server localhost:6667
--entity-type brokers --entity-default --alter
--add-config message.max.bytes=8388608
Kafka records start getting DELETED
faster than expected
(for compact topics too)
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
@NSilnitsky
What to Expect when Migrating to Multi Cluster
Restored records from
another Data Center
Kafka records start getting DELETED
faster than expected
(for compact topics too)
Luckily for us ...
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
@NSilnitsky
What to Expect when Migrating to Multi Cluster
Affected Broker versions: 1.1.0, 2.0.1, 2.1.1, 2.2.2, 2.4.0, 2.3.1
Fix: Changed dummy value
for all topic configs
“
Unexpected error
from SyncGroup:
The server
experienced an
unexpected error
when processing the
request.
@NSilnitsky
Replicator
service
Self
Hosted
Kafka
Cluster
Managed
Kafka
Cluster
Topic 2
Topic N
Topic 1
What to Expect when Migrating to Multi Cluster
Topic 2
Topic N
Topic 1
Sharded Consumers ...
A B
@NSilnitsky
Greyhound
Kafka
Cluster
Self-hosted
Replicator service
Migration
Orchestrator
Unsubscribe
What to Expect when Migrating to Multi Cluster
We have the infra in place
to control it from the
outside!
So we will soon be able to …
1. Switch Cluster
2. Skip Messages
3. Change processing rate
Without GA!
Confluent →
@NSilnitsky
Migrating to Multi Cluster Managed Kafka
We used Greyhound
& dedicated
orchestration
services for
an automatic, safe,
and gradual
migration.
Multi
Cluster
To
Single
Cluster
From
overloaded self-hosted
Migrations
we ❤
managed
optimized
@NSilnitsky
https://medium.com/wix-engineering/migrating-to-a-multi-cluster-man
aged-kafka-with-0-downtime-b936655f888e
The blog post
Migrating to Multi Cluster Managed Kafka
@NSilnitsky
A Scala/Java high-level SDK for Apache Kafka.
github.com/wix/greyhound
Migrating to Multi Cluster Managed Kafka
0.2 is out!
@NSilnitsky
Thank
You!
Migrating to Multi Cluster Managed Kafka
natansil.com twitter@NSilnitsky linkedin/natansilnitsky github.com/natansil
👉 slideshare.net/NatanSilnitsky
Any questions?

Migrating to Multi Cluster Managed Kafka - Conf42 - CloudNative