Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications

Using Apache Cassandra and
Apache Kafka to Scale Next Gen
Applications
Adam Zegelin
Founding Software Engineer, Instaclustr

1.Xxxxxxxxx
xxxxx
Introduction
• Adam Zegelin
• Co-founded Instaclustr 5 years ago
• In Canberra, Australia
• Current focus is Cassandra on Kubenetes
• Instaclustr
• Managed Apache Cassandra, Spark and Kafka in the ☁️
 AWS, GCP, Azure & IBM
 3000 nodes under management
 24×7×365 support
• Consulting
 Schema & application design
 Workshops & Training
• 2nd-level on-call support for on-premise deployments

Agenda
• Introduction to Cassandra and Kafka
• Real-world Use Cases
• Worldpay
• Lendi
• Instaclustr
• Partitioning: the key to scale
• Fitting and architecting for your use case

• Linearly Scalable
• Always Available
• Multi-Region Data
Store
• Apache Cassandra is the leading NoSQL operational
database for high-scale and high-reliability applications.
• Shared nothing peer-to-peer architecture provides
reliability up to 100% (with Instaclustr SLAs).
• replicated data and multiple nodes capable of fulfilling queries
 Node outage? Service just keeps running
• full online maintenance and in-place upgrades
• Low latency for operational applications
• Sub-10ms P95 reads and writes achievable
• Native active-active multi data center support
• Geographic distribution (to meet latency requirements)
• Disaster resilience
• Workload isolation (analytics)
• Cassandra is a data storage system, not an
analytics/query engine or place to run logic

Typical Use Cases
• High write to read ratio
• Data is rarely updated
• Including explicit deletes
• The Primary Key is known at read time
• Limited filtering & aggregation
• No JOINs or referential integrity
• Transaction logging
• Time series data
• IoT status and event history
• Health tracker data
• Order & package statuses & tracking
• Weather service history
• Messages and email envelopes

Queuing, Pub/Sub and
Streaming at Scale
• Apache Kafka is a distributed streaming platform
• Publish and subscribe to streams of records
 Similar to a message queue or EMS
• Store streams of records
 Fault-tolerant
 Durable
• Process streams of records
 as they occur
 randomly, any position in the stream
• Replicated architecture
• High-level similarities to Cassandra
• Scalability
• Reliability

Typical Use Cases
• As a message bus
• Loose coupling between producers and consumers
• Basis for micro-services
• As a commit log
• A store of logical transactions
• Populating analytical data stores or edge caches
• As a buffer
• Manage backpressure & workload spikes
And when combined with Kafka Streams/Spark Streaming…
• As the basis of a streaming architecture
• (near) real-time analytics
• Data processing pipelines

Typical Use Cases
cont’d
• Website activity tracking
• Page views
• Searches
• Other user actions
• Metrics
• Operational monitoring data
• Log aggregation
• Centralized logging
• Event sourcing
• Application state changes
• “we don't just want to see where we are, we also want to know
how we got there”

Case study
• Payment processor
• spun out of RBS in 2010
• merged with Vantive in US in Jan 2018 for USD 10.4B to form
WorldPay Inc.
• Processes
• >40 Million transactions per day
• for 400,000 merchants
• 42% of all UK non-cash transactions

Case study
cont’d
• Re-architecting of WorldPay’s XML Payment API
• facilitates ~40M transactions per month
• New architecture based on open source technologies
• including Cassandra and Kafka
• to provide scalability, availability and reduced costs
• New Idempotency Service
• first project to use the new architecture
• provides capabilities to ensure payments are not repeated

Case study
cont’d
• Challenges
• Tight deployment timeframe
• Very high availability expectations
• Low latency requirements
• Utilises Cassandra to provide highest levels of availability
and scalability
• 18 node cluster
• 3 AWS regions (in Europe)
• Leverages Cassandras tuneable consistency
 QUORUM = strong consistency across regions
 still able to operate with a whole region unavailable
 Latency is tolerable (restricted to EU)
• Simple data model with atomic reads/writes
 fits well with Cassandra capability

Case study
cont’d
• Worked with Instaclustr to accelerate development and
time to stable service:
• Consulting engagement assisted with data model design
• Cassandra cluster run on Instaclustr managed service
 production ready in weeks
• Initial preference was to run on-prem
• security compliance
• did not expect cloud to meet latency requirements
• However, timeframes did not allow establishment of
internal deployment
• Used Instaclustr’s managed Cassandra service on AWS for
initial go-live.
• Now satisfied as a long-term solution

Case study
• Australia’s leading online home loan lender
• Processing over 90% of Australia’s online lending enquiries.
• Re-architecture of their platform following a major
funding round
• customer and data-centric

Case study
cont’d
• Integration-heavy environment
• Bespoke interfaces with banks, etc.
• Moving to a micro-services architecture
• Kafka as a message bus
• New architecture
• Decoupled application code from embedded data sets from
various business applications
• Unified data models from the various point solutions and
market segments
• Enabled extensive scale
 supports rapid and large growth in data as the consumer base
grows

Case study
• Cassandra
• Storage for monitoring metrics & events
• Custom collector
• RabbitMQ transport
 Will eventually move to Kafka as the transport
• Metrics are processed by Riemann
 Raises PagerDuty alerts, tickets, emails
 Writes to Cassandra
• Kafka
• Centralised logging
• Events are collected by fluentd
• Pumped into LogStash via Kafka
• Indexed via ElasticSearch
• Viewed with Kibana

Partitioning
The key to scale
• Partitioning
• using a key in your data to split the data across multiple
servers
• Manual partitioning is possible but painful
• Cassandra and Kafka make partitioning transparent
• needs conscious consideration

1.Xxxxxxxxx
xxxxx
Cassandra Cluster
Cluster
Data Center (optional)
Rack (optional, recommended)
Node

1.Xxxxxxxxx
xxxxx
Partitioning

1.Xxxxxxxxx
xxxxx
Cassandra Partitions
Queuing and Streaming at Scale

1.Xxxxxxxxx
xxxxxQueuing and Streaming at Scale
● Broker
○ Node/server/VM
● Topic
○ Logical grouping of data (category/feed/name)
○ Settings:
○ Replication
○ Partition count
○ Retention
○ Compaction
○ …
Kafka Brokers, Topics and Partitions

1.Xxxxxxxxx
Partition
○ Subset of messages in a topic
■ Have a single master broker
■ Guarantee ordered delivery within that
subset
○ Number of partitions is set on topic creation
Kafka Topics and Partitions (cont’d)

1.Xxxxxxxxx
• Messages are mapped to a partition by the Producer
• Randomly/round-robin
• Hash of record key
• Consumers are members of Consumer Groups
• Consumer Groups register to consume records from
Topics
• Each Consumer in a Consumer Group is the exclusive
consumer of a “fair share” of partitions in the topic.
Kafka Partitions in Action

Fitting and
architecting
for your
use case
Cassandra
• Big data
• one or more individually big (>1TB) tables
• Need to pre-determine read pattern
• at least to partition key
• Very low cost writes
• great for high write / read ratio use cases
• Ideal for small reads
• 1, 10, 100, 1000 rows at a time
• No limits to horizontal scaling (data size or ops/sec)
• provided you can find a partition that fits.
• No relational integrity
• No Foreign Keys, no JOIN’s
• Limited filtering, aggregation

Fitting and
architecting
for your
use case
Kafka
• Big data
• 5k+ message/topic/second
• Not transactional
• unlike traditional MQ tech
• although guaranteed once delivery now available
• Kafka Streams very powerful tool for analysis and
mutations on data streams

Adam Zegelin
adam@instaclustr.com
Founding Software
Engineer

Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications

Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications

More Related Content

What's hot

Similar to Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications

More from Data Con LA

Recently uploaded

Using Apache Cassandra and Apache Kafka to Scale Next Gen Applications

Editor's Notes