Renegotiating the boundary
between database latency
and consistency
Presented by: Tzach Livyatan, VP of Product, ScyllaDB and Konstantin Osipov,
Director, Software Engineering, ScyllaDB
Moderated by: Srini Penchikala, InfoQ Editor
Tzach Livyatan
2
VP of Product, ScyllaDB
+ Lead the product team in ScyllaDB
+ Appreciate distributed system testing
+ Lives in Tel Aviv, father of two
Konstantin Osipov
3
Director of Engineering, ScyllaDB
+ Worked on Consensus Algorithms in
ScyllaDB
+ Crazy about distributed system testing
+ Lives in Moscow, father of two
Speaker Photo
4
+ For distributed, data-intensive apps that require high
performance and low latency
+ 400+ users worldwide
+ Results
+ Comcast: Reduced P99 latencies by 95%
+ FireEye: 1500% improvement in throughput
+ Discord: Reduced C* nodes from ~140 to 6
+ iFood: 9X cost reduction vs. DynamoDB
+ Open Source, Enterprise and Cloud options
+ Fully compatible with Apache Cassandra and Amazon
DynamoDB
About ScyllaDB
1ms <1ms
10ms
1M
10M
ScyllaDB Universe of 400+ Users
400+ Companies Use ScyllaDB
Seamless experiences
across content + devices
Fast computation of flight
pricing
Corporate fleet
management
Real-time analytics
2,000,000 SKU -commerce
management
Real-time location tracking
for friends/family
Video recommendation
management
IoT for industrial
machines
Synchronize browser
properties for millions
Threat intelligence service
using JanusGraph
Real time fraud detection
across 6M transactions/day
Uber scale, mission critical
chat & messaging app
5
Network security threat
detection
Power ~50M X1 DVRs with
billions of reqs/day
Precision healthcare via
Edison AI
Inventory hub for retail
operations
Property listings and
updates
Unified ML feature store
across the business
Cryptocurrency exchange
app
Geography-based
recommendations
Distributed storage for
distributed ledger tech
Global operations- Avon,
Body Shop + more
Predictable performance for
on sale surges
GPS-based exercise
tracking
Agenda ■ Introduction to ScyllaDB
■ Consistency vs Availability
■ Problem statement: Schema and Topology
Consistency
■ Raft in ScyllaDB
■ Schema and Topology Consistency in ScyllaDB 5.x
■ Next steps
■ QA
6
7
A Brief History of Databases
7
1970s
Mainframes:
inception of the
relational model
1990s
LAN age:
replication, external
caching, ORMs
SQL
1980s
SQL, relational
databases become
de-facto standard
2000s
WEB 2.0:
NoSQL databases
for scale
2010s
Cloud age:
commoditization
of NoSQL, NewSQL
inception
1996
1995
1978 2008
2015
2014
Cloud infrastructure: The last ~10 years
8
SSD: $2500/TB
Performance
improvement
2008 2012
Typical instance 4 cores
SSD $100/TB - 1000x faster, 10x cheaper
96 core VMs - 20x more cores
100Gbps NICs - 100x more throughput
2015 2022
2000 CPU core systems and
beyond
NoSQL – By Data Model
Key / Value Redis, Aerospike, RocksDB
Document store MongoDB, Couchbase
Wide column store Scylla, Apache Cassandra,
HBase, DynamoDB
Graph Neo4j, JanusGraph
Complexity
9
NoSQL– By Availability vs Consistency
10
Pick Two
Availability
Partition Tolerance
Consistency
Cluster Level Read 1
11
Cluster Level Read 2
12
Cluster Level Read 3
13
Cluster Level Read 4
14
Active/Active, replicated, auto-sharded
15
ScyllaDB Architecture – Eventually Consistent
App
App
App
App
App
App
CL= Local
Quorum
CL= One
16
Scylla Architecture
The Problem with Metadata
Eventual Consistency
What is Database Schema?
Replicating Schema Changes
7
7
6
6
5
CREATE KEYSPACE
clicks
WITH { replication … }
Consistency Model of Schema Changes
id first last
1 John Doe
Time
Node A: Node B:
id first last email
1 John Doe
2 Jenny Smith j@...
id first last email phone
1 John Doe
2 Jenny Smith j@... (867)
id first last phone
1 John Doe
2 Jenny Smith (867)
Split
brain
(In)consistency of Schema Changes
cqlsh:test> create table t (a int primary key);
----------------------------------------------- split ------------------------------------------
cqlsh:test> alter table t rename a to d;
Warning: schema version mismatch detected
cqlsh:test> insert into t (d) values (1);
Cannot execute this query as it might involve data filtering and thus
may have unpredictable performance.
cqlsh:test> insert into t (a) values (1);
Unknown identifier a
Eventual Consistency of
Topology Changes
What is Topology?
Topology is defined as all of the following:
the set of nodes in the cluster,
location of those nodes in DCs and racks,
and assignment of ownership of data to nodes
Token Metadata
+ Members, data partitioning and distribution
+ Where does each key live in the cluster?
Token Partitioning
+ token = hash(partition key)
+ token ring: space of all tokens, set of all partition keys
+ token range: set of partition keys
Token ring:
token
token
range
token
Token Metadata
node A node B node C
A
C
B
C
A
B
Token metadata:
+ Each node has a set of tokens assigned during bootstrap
(vnodes)
+ Tokens combined determine primary owning replicas for key
ranges
Token Metadata
A
C
B
C
A
B
{A, C}
{C, B}
{B, A}
{C, A}
{A, B}
{B, C}
token
metadata
replication
metadata
replication
strategy
Eventually (In)consistent Topology
+ To ensure data consistency, all coordinators need to agree on
topology
+ Eventually consistent propagation -> stale topology
node A node B node C
Eventually (In)consistent Topology
node A node B node C
A
C
B
C
A
B
Token
Metadata
A
C
B
C
A
B
A
C
B
C
A
B
Eventually (In)consistent Topology
node A node B node C
Cluster
down!
Eventually (In)consistent Topology
node A node B node C
Cluster up
except node C
Eventually (In)consistent Topology
node A node B node C
Token
metadata
(in gossip)
A
B
A
B
Cluster up
except node C
A
B
A
B
Eventually (In)consistent Topology
node A node B node C
Token
metadata
Cluster up
except node C
A
C
B
C
A
B
A
B
A
B
local view in gossip
Eventually (In)consistent Topology
node A node B node C
Token
metadata
(in gossip)
A
B
A
B
node D
A
B
A
B
Bootstrapping
node D
A
B
A
B
Eventually (In)consistent Topology
node A node B node C
Token
metadata
node D
A
B
A
B
Bootstrapping
node D
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
Eventually (In)consistent Topology
Token
metadata
A
B
A
B
A
C
B
C
A
B
A
B
A
B
local view local view
in gossip
+ Different token metadata -> different replica sets
+ Different nodes use different quorums -> inconsistent reads
+ Writes go to the wrong replica set temporarily
+ etc.
Eventually (In)consistent Topology
“Cannot” happen:
“Before adding the new node,
check the node’s status in the cluster using nodetool status
command.
You cannot add new nodes to the cluster if any of the nodes are
down.” [1]
[1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/
Strongly Consistent Topology
The plan:
+ Make the database responsible for consistency under all conditions
Why:
+ Gives a reliable safety net for admins
+ Reduces stress
+ Increases confidence
+ Simplifies procedures
What is Raft?
Raft Intro
Raft is a protocol for state machine replication.
What does it mean?
+ The majority of nodes have the same state
+ State transition happens in the same order on all nodes
Cluster topology is part of the state
How Raft Achieves Consistency
State
machine
State
machine
State
machine
Node A Node B Node C
How Raft Achieves Consistency
State
machine
Log
x←1 y←2 z←3
State
machine
Log
x←1 y←2 z←3
State
machine
Log
x←1 y←2 z←3
Node A Node B Node C
How Raft Achieves Consistency
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Consensus
module
State
machine
Log
x←1 y←2 z←3
Node A Node B Node C
Leader Based Replication
7
6
6
6
6
CREATE KEYSPACE
clicks
WITH { replication … }
Raft Leadership Changes
Election starts: S1 is a candidate: More candidates: S1 is elected leader:
T i m e
Raft Configuration Changes
x←1
add
node D
y←2 z←3
del
node A
Time
Replicated log
Non-voting Members
A
ADD NODE B
B
How ScyllaDB uses Raft
Setting up a Fresh Cluster
On a fresh start, ScyllaDB node:
+ Generates and persists unique random Server ID (UUID)
+ Contacts all known peers. Strictly after:
+ contacting all peers in seeds: list
+ exchanging all known Server IDs
+ AND not finding an existing cluster
+ AND if this Server ID is lexicographically the smallest
+ Creates a new Raft Group ID and a new cluster
2, 3
2, 3
2, 3
2, 3
2, 3
Setting up a Fresh Cluster
1
2
3
4
5
1
2
3
4
5
2, 3, 1, 4, 5
1
2
3
4
5
T i m e
Topology Changes on Raft
system.token_metadata
+ Have a RAFT group which includes all cluster members (raft_group0)
+ Token metadata be the state machine which is replicated by RAFT
+ Changes of token metadata are raft commands
Schema changes on Raft
To execute a DDL statement, the server:
+ Takes Raft read barrier
+ Reads the latest schema and validates CQL
+ Builds Raft command and signs it with old and new schema id
+ Once command is committed, it’s applied only if old schema id
is the same
+ Retries if commit or apply failed
The Balance Between
Consistency and Availability
Availability of DML
S1
S2
S3
CREATE TABLE t ADD COLUMN b CREATE INDEX t_i1
Raft log:
I
N
S
E
R
T
I
N
T
O
t
S
E
T
b
=
2
S
E
L
E
C
T
b
- schema fetch
+ RAFT eagerly replicates to every node
+ Like RF=ALL tables with auto-repair
+ Request coordinators still use the local view on topology
+ No extra coordination when executing user requests
+ Topology changes use linearizable access for learning and
modification
+ No need for sleep(30s)
+ Faster topology changes
Replacing Gossip with RAFT
Solved Issues
+ Concurrent DDL is now safe
+ Safe topology changes enable elasticity
+ still under --experimental-features-raft
+ Enabled if all nodes are 5.0
Split Brain Problem
App
App
App
App
App
App
Introduced Issues
Raft prefers CONSISTENCY over AVAILABILITY. What does it mean?
+ 2-data center set ups become more fragile
+ Prefer odd number of DCs to avoid split brain
+ Import sstables into a new cluster if permanent loss of majority
+ 5.0 cluster with Raft can’t downgrade to 4.x
Summary
Steps to Stronger Consistency in ScyllaDB
+ Tests, tests and more tests
+ Schema consistency - Experimental in 5.0
+ Topology consistency - Coming in 5.x
+ Tablets consistency - Coming in 5.x
United States
2445 Faber St, Suite #200
Palo Alto, CA USA 94303
Israel
Maskit 4
Herzliya, Israel 4673304
www.scylladb.com
@scylladb
Thank You! Do reach out!
@kostja_osipov
kostja@scylladb.com
Konstantin Osipov
@tzachl
tzach@scylladb.com
Tzach Livyatan InfoQ
webinars@infoq.com

Renegotiating the boundary between database latency and consistency

  • 1.
    Renegotiating the boundary betweendatabase latency and consistency Presented by: Tzach Livyatan, VP of Product, ScyllaDB and Konstantin Osipov, Director, Software Engineering, ScyllaDB Moderated by: Srini Penchikala, InfoQ Editor
  • 2.
    Tzach Livyatan 2 VP ofProduct, ScyllaDB + Lead the product team in ScyllaDB + Appreciate distributed system testing + Lives in Tel Aviv, father of two
  • 3.
    Konstantin Osipov 3 Director ofEngineering, ScyllaDB + Worked on Consensus Algorithms in ScyllaDB + Crazy about distributed system testing + Lives in Moscow, father of two Speaker Photo
  • 4.
    4 + For distributed,data-intensive apps that require high performance and low latency + 400+ users worldwide + Results + Comcast: Reduced P99 latencies by 95% + FireEye: 1500% improvement in throughput + Discord: Reduced C* nodes from ~140 to 6 + iFood: 9X cost reduction vs. DynamoDB + Open Source, Enterprise and Cloud options + Fully compatible with Apache Cassandra and Amazon DynamoDB About ScyllaDB 1ms <1ms 10ms 1M 10M ScyllaDB Universe of 400+ Users
  • 5.
    400+ Companies UseScyllaDB Seamless experiences across content + devices Fast computation of flight pricing Corporate fleet management Real-time analytics 2,000,000 SKU -commerce management Real-time location tracking for friends/family Video recommendation management IoT for industrial machines Synchronize browser properties for millions Threat intelligence service using JanusGraph Real time fraud detection across 6M transactions/day Uber scale, mission critical chat & messaging app 5 Network security threat detection Power ~50M X1 DVRs with billions of reqs/day Precision healthcare via Edison AI Inventory hub for retail operations Property listings and updates Unified ML feature store across the business Cryptocurrency exchange app Geography-based recommendations Distributed storage for distributed ledger tech Global operations- Avon, Body Shop + more Predictable performance for on sale surges GPS-based exercise tracking
  • 6.
    Agenda ■ Introductionto ScyllaDB ■ Consistency vs Availability ■ Problem statement: Schema and Topology Consistency ■ Raft in ScyllaDB ■ Schema and Topology Consistency in ScyllaDB 5.x ■ Next steps ■ QA 6
  • 7.
    7 A Brief Historyof Databases 7 1970s Mainframes: inception of the relational model 1990s LAN age: replication, external caching, ORMs SQL 1980s SQL, relational databases become de-facto standard 2000s WEB 2.0: NoSQL databases for scale 2010s Cloud age: commoditization of NoSQL, NewSQL inception 1996 1995 1978 2008 2015 2014
  • 8.
    Cloud infrastructure: Thelast ~10 years 8 SSD: $2500/TB Performance improvement 2008 2012 Typical instance 4 cores SSD $100/TB - 1000x faster, 10x cheaper 96 core VMs - 20x more cores 100Gbps NICs - 100x more throughput 2015 2022 2000 CPU core systems and beyond
  • 9.
    NoSQL – ByData Model Key / Value Redis, Aerospike, RocksDB Document store MongoDB, Couchbase Wide column store Scylla, Apache Cassandra, HBase, DynamoDB Graph Neo4j, JanusGraph Complexity 9
  • 10.
    NoSQL– By Availabilityvs Consistency 10 Pick Two Availability Partition Tolerance Consistency
  • 11.
  • 12.
  • 13.
  • 14.
  • 15.
    Active/Active, replicated, auto-sharded 15 ScyllaDBArchitecture – Eventually Consistent App App App App App App CL= Local Quorum CL= One
  • 16.
  • 17.
    The Problem withMetadata Eventual Consistency
  • 18.
  • 19.
    Replicating Schema Changes 7 7 6 6 5 CREATEKEYSPACE clicks WITH { replication … }
  • 20.
    Consistency Model ofSchema Changes id first last 1 John Doe Time Node A: Node B: id first last email 1 John Doe 2 Jenny Smith j@... id first last email phone 1 John Doe 2 Jenny Smith j@... (867) id first last phone 1 John Doe 2 Jenny Smith (867) Split brain
  • 21.
    (In)consistency of SchemaChanges cqlsh:test> create table t (a int primary key); ----------------------------------------------- split ------------------------------------------ cqlsh:test> alter table t rename a to d; Warning: schema version mismatch detected cqlsh:test> insert into t (d) values (1); Cannot execute this query as it might involve data filtering and thus may have unpredictable performance. cqlsh:test> insert into t (a) values (1); Unknown identifier a
  • 22.
  • 23.
    What is Topology? Topologyis defined as all of the following: the set of nodes in the cluster, location of those nodes in DCs and racks, and assignment of ownership of data to nodes
  • 24.
    Token Metadata + Members,data partitioning and distribution + Where does each key live in the cluster?
  • 25.
    Token Partitioning + token= hash(partition key) + token ring: space of all tokens, set of all partition keys + token range: set of partition keys Token ring: token token range token
  • 26.
    Token Metadata node Anode B node C A C B C A B Token metadata: + Each node has a set of tokens assigned during bootstrap (vnodes) + Tokens combined determine primary owning replicas for key ranges
  • 27.
    Token Metadata A C B C A B {A, C} {C,B} {B, A} {C, A} {A, B} {B, C} token metadata replication metadata replication strategy
  • 28.
    Eventually (In)consistent Topology +To ensure data consistency, all coordinators need to agree on topology + Eventually consistent propagation -> stale topology node A node B node C
  • 29.
    Eventually (In)consistent Topology nodeA node B node C A C B C A B Token Metadata A C B C A B A C B C A B
  • 30.
    Eventually (In)consistent Topology nodeA node B node C Cluster down!
  • 31.
    Eventually (In)consistent Topology nodeA node B node C Cluster up except node C
  • 32.
    Eventually (In)consistent Topology nodeA node B node C Token metadata (in gossip) A B A B Cluster up except node C A B A B
  • 33.
    Eventually (In)consistent Topology nodeA node B node C Token metadata Cluster up except node C A C B C A B A B A B local view in gossip
  • 34.
    Eventually (In)consistent Topology nodeA node B node C Token metadata (in gossip) A B A B node D A B A B Bootstrapping node D A B A B
  • 35.
    Eventually (In)consistent Topology nodeA node B node C Token metadata node D A B A B Bootstrapping node D A C B C A B A B A B local view local view in gossip
  • 36.
    Eventually (In)consistent Topology Token metadata A B A B A C B C A B A B A B localview local view in gossip + Different token metadata -> different replica sets + Different nodes use different quorums -> inconsistent reads + Writes go to the wrong replica set temporarily + etc.
  • 37.
    Eventually (In)consistent Topology “Cannot”happen: “Before adding the new node, check the node’s status in the cluster using nodetool status command. You cannot add new nodes to the cluster if any of the nodes are down.” [1] [1] https://docs.scylladb.com/operating-scylla/procedures/cluster-management/add-node-to-cluster/
  • 38.
    Strongly Consistent Topology Theplan: + Make the database responsible for consistency under all conditions Why: + Gives a reliable safety net for admins + Reduces stress + Increases confidence + Simplifies procedures
  • 39.
  • 40.
    Raft Intro Raft isa protocol for state machine replication. What does it mean? + The majority of nodes have the same state + State transition happens in the same order on all nodes Cluster topology is part of the state
  • 41.
    How Raft AchievesConsistency State machine State machine State machine Node A Node B Node C
  • 42.
    How Raft AchievesConsistency State machine Log x←1 y←2 z←3 State machine Log x←1 y←2 z←3 State machine Log x←1 y←2 z←3 Node A Node B Node C
  • 43.
    How Raft AchievesConsistency Consensus module State machine Log x←1 y←2 z←3 Consensus module State machine Log x←1 y←2 z←3 Consensus module State machine Log x←1 y←2 z←3 Node A Node B Node C
  • 44.
    Leader Based Replication 7 6 6 6 6 CREATEKEYSPACE clicks WITH { replication … }
  • 45.
    Raft Leadership Changes Electionstarts: S1 is a candidate: More candidates: S1 is elected leader: T i m e
  • 46.
    Raft Configuration Changes x←1 add nodeD y←2 z←3 del node A Time Replicated log
  • 47.
  • 48.
  • 49.
    Setting up aFresh Cluster On a fresh start, ScyllaDB node: + Generates and persists unique random Server ID (UUID) + Contacts all known peers. Strictly after: + contacting all peers in seeds: list + exchanging all known Server IDs + AND not finding an existing cluster + AND if this Server ID is lexicographically the smallest + Creates a new Raft Group ID and a new cluster
  • 50.
    2, 3 2, 3 2,3 2, 3 2, 3 Setting up a Fresh Cluster 1 2 3 4 5 1 2 3 4 5 2, 3, 1, 4, 5 1 2 3 4 5 T i m e
  • 51.
    Topology Changes onRaft system.token_metadata + Have a RAFT group which includes all cluster members (raft_group0) + Token metadata be the state machine which is replicated by RAFT + Changes of token metadata are raft commands
  • 52.
    Schema changes onRaft To execute a DDL statement, the server: + Takes Raft read barrier + Reads the latest schema and validates CQL + Builds Raft command and signs it with old and new schema id + Once command is committed, it’s applied only if old schema id is the same + Retries if commit or apply failed
  • 53.
  • 54.
    Availability of DML S1 S2 S3 CREATETABLE t ADD COLUMN b CREATE INDEX t_i1 Raft log: I N S E R T I N T O t S E T b = 2 S E L E C T b - schema fetch
  • 55.
    + RAFT eagerlyreplicates to every node + Like RF=ALL tables with auto-repair + Request coordinators still use the local view on topology + No extra coordination when executing user requests + Topology changes use linearizable access for learning and modification + No need for sleep(30s) + Faster topology changes Replacing Gossip with RAFT
  • 56.
    Solved Issues + ConcurrentDDL is now safe + Safe topology changes enable elasticity + still under --experimental-features-raft + Enabled if all nodes are 5.0
  • 57.
  • 58.
    Introduced Issues Raft prefersCONSISTENCY over AVAILABILITY. What does it mean? + 2-data center set ups become more fragile + Prefer odd number of DCs to avoid split brain + Import sstables into a new cluster if permanent loss of majority + 5.0 cluster with Raft can’t downgrade to 4.x
  • 59.
  • 60.
    Steps to StrongerConsistency in ScyllaDB + Tests, tests and more tests + Schema consistency - Experimental in 5.0 + Topology consistency - Coming in 5.x + Tablets consistency - Coming in 5.x
  • 61.
    United States 2445 FaberSt, Suite #200 Palo Alto, CA USA 94303 Israel Maskit 4 Herzliya, Israel 4673304 www.scylladb.com @scylladb Thank You! Do reach out! @kostja_osipov kostja@scylladb.com Konstantin Osipov @tzachl tzach@scylladb.com Tzach Livyatan InfoQ webinars@infoq.com