Five Lessons
in Distributed Databases
Jonathan Ellis
CTO, DataStax
1 © DataStax, All Rights Reserved. Confidential
© DataStax, All Rights Reserved.
1. If it’s not SQL, it’s not a database
© DataStax, All Rights Reserved.
A brief history of NoSQL
● Early 2000s: people hit limits on vertical scaling, start
sharding RDBMSes
● 2006, 2007: BigTable, Dynamo papers
● 2008-2010: Explosion of scale-out systems
○ Voldemort, Riak, Dynomite, FoundationDB, CouchDB
○ Cassandra, HBase, MongoDB
© DataStax, All Rights Reserved.
One small problem
© DataStax, All Rights Reserved.
Cassandra’s experience
● Thrift RPC “drivers” too low level
● Fragmented: Hector, Pelops, Astyanax
● Inconsistent across language ecosystems
© DataStax, All Rights Reserved.
© DataStax, All Rights Reserved.
Solution: CQL
● 2011: Cassandra 0.8 introduces CQL 1.0
● 2012: Cassandra 1.1 introduces CQL 3.0
● 2013: Cassandra 1.2 adds collections
© DataStax, All Rights Reserved.
Today
● Cassandra: CQL
● CosmosDB: “SQL”
● Cloud Spanner: “SQL”
● Couchbase: N1QL
● HBase: Phoenix SQL (Java only)
● DynamoDB: REST/JSON
● MongoDB: BSON
© DataStax, All Rights Reserved.
2. It takes 5+ years to build a database
© DataStax, All Rights Reserved.
Curt Monash
Rule 1: Developing a good DBMS requires 5-7 years and
tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.
© DataStax, All Rights Reserved.
Aside: Mistakes I made starting DataStax
● Stayed at Rackspace too long
● Raised a $2.5M series A
● Waited a year to get serious about enterprise sales
● Changed the company name
● Brisk
© DataStax, All Rights Reserved.
Examples (Curt)
● Concurrent workloads benchmarked in the lab are poor
predictors of concurrent performance in real life.
● Mixed workload management is harder than you’re
assuming it is.
● Those minor edge cases in which your Version 1
product works poorly aren’t minor after all.
© DataStax, All Rights Reserved.
Examples (Cassandra)
● Hinted handoff
● Repair
● Counters
● Paxos
● Test suite
© DataStax, All Rights Reserved.
Aside: Fallout (Jepsen at Scale)
● Ensemble - A set of clusters that is brought up/torn
down each test
○ Server Cluster - Cassandra/DSE
○ Client Cluster - Load Generators
○ Observer Cluster - Records live information from clusters (OpsCenter/Graphite)
○ Controller - Fallout
● Workload - The guts of the test
○ Phases - Run sequentially. Contains one or more modules that run in parallel for that
phase
○ Checkers - Run after all phases and verify the data emitted from modules.
○ Artifact Checkers - Runs against collected artifacts to look for correctness/problems
© DataStax, All Rights Reserved.
A simple Fallout workload
ensemble:
server:
node.count: 3
provisioner:
name: local
configuration_manager:
name: ccm
properties:
cassandra.version: 3.0.0
client: server #use server cluster
phases:
- insert_workload:
module: stress
properties:
iterations: 1m
type: write
rf: 3
gossip_updown:
module: nodetool
properties:
command: disablegossip
secondary.command: enablegossip
sleep.seconds: 10
sleep.randomize: 20
- read_workload:
module: stress
properties:
iterations: 1m
type: read
checkers:
verify_success:
checker: nofail
1. Start 3 node ccm cluster.
2. Insert data while bringing
gossip on the nodes up and
down.
3. Read/Check the data.
4. Verify none of the steps
failed.
Note: to move from ccm to ec2
we only need to change the
ensemble section.
© DataStax, All Rights Reserved.
5-7 years?
● Cassandra became Apache TLP in Feb 2010
● 3.0 released Fall 2015
● OSS is about adoption, not saving time/money
© DataStax, All Rights Reserved.
3. The customer is always right
© DataStax, All Rights Reserved.
Example: sequential scans
SELECT * FROM user_purchases
WHERE purchase_date > 2000
© DataStax, All Rights Reserved.
What’s wrong with this query?
For 100,000 purchases, nothing.
For 100,000,000 purchases, you’ll crash the server
(in 2012).
© DataStax, All Rights Reserved.
Solution (2012): ALLOW FILTERING
SELECT * FROM user_purchases
WHERE purchase_date > 2000
ALLOW FILTERING
© DataStax, All Rights Reserved.
Better solution (2013): Paging
● Build resultset incrementally and “page” it to the client
© DataStax, All Rights Reserved.
Example: tombstones
INSERT INTO foo VALUES (1254, …)
DELETE FROM foo WHERE id = 1254
…
SELECT * FROM foo
© DataStax, All Rights Reserved.
Solution (2013)
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000
© DataStax, All Rights Reserved.
Better Solution (???): It’s complicated
● Track repair status to get rid of GCGS
● Bring time-to-repair from “days” to “hours”
● Optional: improve time-to-compaction
© DataStax, All Rights Reserved.
Example: joins
● CQL doesn’t support joins
● People still use client-side joins instead of
denormalizing
© DataStax, All Rights Reserved.
Solution (2015-???): MV
● Make it easier to denormalize
© DataStax, All Rights Reserved.
Better solution (???): actually add joins
● Less controversial: shared partition joins
● More controversial: cross-partition
● CosmosDB, Spanner
© DataStax, All Rights Reserved.
A note on configurability
© DataStax, All Rights Reserved.
4. Too much magic is a bad thing
© DataStax, All Rights Reserved.
Not (just) about vendors overpromising
● “Our database isn’t subject to the limits of the CAP
theorem”
● “Our queue can guarantee exactly once delivery”
● “We’ll give you 99.99% uptime*”
© DataStax, All Rights Reserved.
Magic can be bad even when it works
© DataStax, All Rights Reserved.
Cloud Spanner analysis excerpt
Spanner’s architecture implies that writes will be significantly slower
than reads due to the need to coordinate across multiple replicas and
avoid overlapping time bounds, and that is what we see in the original
2012 Spanner paper.
… Besides write performance in isolation, because Spanner uses
pessimistic locking to achieve ACID, reads are locked out of rows
(partitions?) that are in the process of being updated. Thus, write
performance challenges can spread to causing problems with reads as
well.
© DataStax, All Rights Reserved.
Cloud Spanner
© DataStax, All Rights Reserved.
Auto-scaling in DynamoDB
● Request capacity tied to “partitions” [pp]
○ pp count = max (rc / 3000, wc / 1000, st / 10 GB)
● Subtle implication: capacity / pp decreases as storage
volume increases
○ Non-uniform: pp request capacity halved when shard splits
● Subtle implication 2: bulk loads will wreck your planning
© DataStax, All Rights Reserved.
“Best practices for tables”
● Bulk load 200M items = 200 GB
● Target 60 minutes = 55,000 write capacity = 55 pps
● Post bulk load steady state
● 1000 req/s = 2 req/pp = 2 req/(3.6M items)
● No way to reduce partition count
© DataStax, All Rights Reserved.
Ravelin, 2017
You construct a table which uses a customer ID as partition key. You
know your customer ID’s are unique and should be uniformly
distributed across nodes. Your business has millions of customers and
no single customer can do so many actions so quickly that the
individual could create a hot key. Under this key you are storing around
2KB of data.
This sounds reasonable.
This will not work at scale in DynamoDb.
© DataStax, All Rights Reserved.
How much magic is too much?
● Joins: Apparently okay
● Auto-scaling: Apparently also okay
● Automatic partitioning: not okay
● Really slow ACID: not okay (?)
● Why?
● How do we make the system more transparent without
inflicting an unnecessary level of detail on the user?
© DataStax, All Rights Reserved.
5. It’s the cloud, stupid
© DataStax, All Rights Reserved.
September 2011
© DataStax, All Rights Reserved.
March 2012
© DataStax, All Rights Reserved.
March 2012
© DataStax, All Rights Reserved.
March 2012
© DataStax, All Rights Reserved.
The cloud is here. Now what?
© DataStax, All Rights Reserved.
Cloud-first architecture
“The second trend will be the increased
prevalence of shared-disk distributed
DBMS. By “shared-disk” I mean a DBMS
that uses a distributed storage layer as its
primary storage location, such as HDFS or
Amazon’s EBS/S3 services. This
separates the DBMS’s storage layer from
its execution nodes. Contrast this with a
shared-nothing DBMS architecture where
each execution node maintains its own
storage.”
© DataStax, All Rights Reserved.
Cloud-first infrastructure
● What on-premises infrastructure can provide a
cloud-like experience?
● Kubernetes?
● OpenStack?
© DataStax, All Rights Reserved.
Cloud-first development
● Is a yearly (bi-yearly?) release process the right
cadence for companies building cloud services?
© DataStax, All Rights Reserved.
Cloud-first OSS
● What does OSS look like when you don’t work for the
big three clouds?
● “Commons Clause” is an attempt to deal with this
○ (What about AGPL?)
© DataStax, All Rights Reserved.
Summary
1. If it’s not SQL, it’s not a database.
2. It takes 5+ years to build a database.
3. Listen to your users.
4. Too much magic is a bad thing.
5. It’s the cloud, stupid.

Five Lessons in Distributed Databases

  • 1.
    Five Lessons in DistributedDatabases Jonathan Ellis CTO, DataStax 1 © DataStax, All Rights Reserved. Confidential
  • 2.
    © DataStax, AllRights Reserved. 1. If it’s not SQL, it’s not a database
  • 3.
    © DataStax, AllRights Reserved. A brief history of NoSQL ● Early 2000s: people hit limits on vertical scaling, start sharding RDBMSes ● 2006, 2007: BigTable, Dynamo papers ● 2008-2010: Explosion of scale-out systems ○ Voldemort, Riak, Dynomite, FoundationDB, CouchDB ○ Cassandra, HBase, MongoDB
  • 4.
    © DataStax, AllRights Reserved. One small problem
  • 5.
    © DataStax, AllRights Reserved. Cassandra’s experience ● Thrift RPC “drivers” too low level ● Fragmented: Hector, Pelops, Astyanax ● Inconsistent across language ecosystems
  • 6.
    © DataStax, AllRights Reserved.
  • 7.
    © DataStax, AllRights Reserved. Solution: CQL ● 2011: Cassandra 0.8 introduces CQL 1.0 ● 2012: Cassandra 1.1 introduces CQL 3.0 ● 2013: Cassandra 1.2 adds collections
  • 8.
    © DataStax, AllRights Reserved. Today ● Cassandra: CQL ● CosmosDB: “SQL” ● Cloud Spanner: “SQL” ● Couchbase: N1QL ● HBase: Phoenix SQL (Java only) ● DynamoDB: REST/JSON ● MongoDB: BSON
  • 9.
    © DataStax, AllRights Reserved. 2. It takes 5+ years to build a database
  • 10.
    © DataStax, AllRights Reserved. Curt Monash Rule 1: Developing a good DBMS requires 5-7 years and tens of millions of dollars. That’s if things go extremely well. Rule 2: You aren’t an exception to Rule 1.
  • 11.
    © DataStax, AllRights Reserved. Aside: Mistakes I made starting DataStax ● Stayed at Rackspace too long ● Raised a $2.5M series A ● Waited a year to get serious about enterprise sales ● Changed the company name ● Brisk
  • 12.
    © DataStax, AllRights Reserved. Examples (Curt) ● Concurrent workloads benchmarked in the lab are poor predictors of concurrent performance in real life. ● Mixed workload management is harder than you’re assuming it is. ● Those minor edge cases in which your Version 1 product works poorly aren’t minor after all.
  • 13.
    © DataStax, AllRights Reserved. Examples (Cassandra) ● Hinted handoff ● Repair ● Counters ● Paxos ● Test suite
  • 14.
    © DataStax, AllRights Reserved. Aside: Fallout (Jepsen at Scale) ● Ensemble - A set of clusters that is brought up/torn down each test ○ Server Cluster - Cassandra/DSE ○ Client Cluster - Load Generators ○ Observer Cluster - Records live information from clusters (OpsCenter/Graphite) ○ Controller - Fallout ● Workload - The guts of the test ○ Phases - Run sequentially. Contains one or more modules that run in parallel for that phase ○ Checkers - Run after all phases and verify the data emitted from modules. ○ Artifact Checkers - Runs against collected artifacts to look for correctness/problems
  • 15.
    © DataStax, AllRights Reserved. A simple Fallout workload ensemble: server: node.count: 3 provisioner: name: local configuration_manager: name: ccm properties: cassandra.version: 3.0.0 client: server #use server cluster phases: - insert_workload: module: stress properties: iterations: 1m type: write rf: 3 gossip_updown: module: nodetool properties: command: disablegossip secondary.command: enablegossip sleep.seconds: 10 sleep.randomize: 20 - read_workload: module: stress properties: iterations: 1m type: read checkers: verify_success: checker: nofail 1. Start 3 node ccm cluster. 2. Insert data while bringing gossip on the nodes up and down. 3. Read/Check the data. 4. Verify none of the steps failed. Note: to move from ccm to ec2 we only need to change the ensemble section.
  • 16.
    © DataStax, AllRights Reserved. 5-7 years? ● Cassandra became Apache TLP in Feb 2010 ● 3.0 released Fall 2015 ● OSS is about adoption, not saving time/money
  • 17.
    © DataStax, AllRights Reserved. 3. The customer is always right
  • 18.
    © DataStax, AllRights Reserved. Example: sequential scans SELECT * FROM user_purchases WHERE purchase_date > 2000
  • 19.
    © DataStax, AllRights Reserved. What’s wrong with this query? For 100,000 purchases, nothing. For 100,000,000 purchases, you’ll crash the server (in 2012).
  • 20.
    © DataStax, AllRights Reserved. Solution (2012): ALLOW FILTERING SELECT * FROM user_purchases WHERE purchase_date > 2000 ALLOW FILTERING
  • 21.
    © DataStax, AllRights Reserved. Better solution (2013): Paging ● Build resultset incrementally and “page” it to the client
  • 22.
    © DataStax, AllRights Reserved. Example: tombstones INSERT INTO foo VALUES (1254, …) DELETE FROM foo WHERE id = 1254 … SELECT * FROM foo
  • 23.
    © DataStax, AllRights Reserved. Solution (2013) tombstone_warn_threshold: 1000 tombstone_failure_threshold: 100000
  • 24.
    © DataStax, AllRights Reserved. Better Solution (???): It’s complicated ● Track repair status to get rid of GCGS ● Bring time-to-repair from “days” to “hours” ● Optional: improve time-to-compaction
  • 25.
    © DataStax, AllRights Reserved. Example: joins ● CQL doesn’t support joins ● People still use client-side joins instead of denormalizing
  • 26.
    © DataStax, AllRights Reserved. Solution (2015-???): MV ● Make it easier to denormalize
  • 27.
    © DataStax, AllRights Reserved. Better solution (???): actually add joins ● Less controversial: shared partition joins ● More controversial: cross-partition ● CosmosDB, Spanner
  • 28.
    © DataStax, AllRights Reserved. A note on configurability
  • 29.
    © DataStax, AllRights Reserved. 4. Too much magic is a bad thing
  • 30.
    © DataStax, AllRights Reserved. Not (just) about vendors overpromising ● “Our database isn’t subject to the limits of the CAP theorem” ● “Our queue can guarantee exactly once delivery” ● “We’ll give you 99.99% uptime*”
  • 31.
    © DataStax, AllRights Reserved. Magic can be bad even when it works
  • 32.
    © DataStax, AllRights Reserved. Cloud Spanner analysis excerpt Spanner’s architecture implies that writes will be significantly slower than reads due to the need to coordinate across multiple replicas and avoid overlapping time bounds, and that is what we see in the original 2012 Spanner paper. … Besides write performance in isolation, because Spanner uses pessimistic locking to achieve ACID, reads are locked out of rows (partitions?) that are in the process of being updated. Thus, write performance challenges can spread to causing problems with reads as well.
  • 33.
    © DataStax, AllRights Reserved. Cloud Spanner
  • 34.
    © DataStax, AllRights Reserved. Auto-scaling in DynamoDB ● Request capacity tied to “partitions” [pp] ○ pp count = max (rc / 3000, wc / 1000, st / 10 GB) ● Subtle implication: capacity / pp decreases as storage volume increases ○ Non-uniform: pp request capacity halved when shard splits ● Subtle implication 2: bulk loads will wreck your planning
  • 35.
    © DataStax, AllRights Reserved. “Best practices for tables” ● Bulk load 200M items = 200 GB ● Target 60 minutes = 55,000 write capacity = 55 pps ● Post bulk load steady state ● 1000 req/s = 2 req/pp = 2 req/(3.6M items) ● No way to reduce partition count
  • 36.
    © DataStax, AllRights Reserved. Ravelin, 2017 You construct a table which uses a customer ID as partition key. You know your customer ID’s are unique and should be uniformly distributed across nodes. Your business has millions of customers and no single customer can do so many actions so quickly that the individual could create a hot key. Under this key you are storing around 2KB of data. This sounds reasonable. This will not work at scale in DynamoDb.
  • 37.
    © DataStax, AllRights Reserved. How much magic is too much? ● Joins: Apparently okay ● Auto-scaling: Apparently also okay ● Automatic partitioning: not okay ● Really slow ACID: not okay (?) ● Why? ● How do we make the system more transparent without inflicting an unnecessary level of detail on the user?
  • 38.
    © DataStax, AllRights Reserved. 5. It’s the cloud, stupid
  • 39.
    © DataStax, AllRights Reserved. September 2011
  • 40.
    © DataStax, AllRights Reserved. March 2012
  • 41.
    © DataStax, AllRights Reserved. March 2012
  • 42.
    © DataStax, AllRights Reserved. March 2012
  • 43.
    © DataStax, AllRights Reserved. The cloud is here. Now what?
  • 44.
    © DataStax, AllRights Reserved. Cloud-first architecture “The second trend will be the increased prevalence of shared-disk distributed DBMS. By “shared-disk” I mean a DBMS that uses a distributed storage layer as its primary storage location, such as HDFS or Amazon’s EBS/S3 services. This separates the DBMS’s storage layer from its execution nodes. Contrast this with a shared-nothing DBMS architecture where each execution node maintains its own storage.”
  • 45.
    © DataStax, AllRights Reserved. Cloud-first infrastructure ● What on-premises infrastructure can provide a cloud-like experience? ● Kubernetes? ● OpenStack?
  • 46.
    © DataStax, AllRights Reserved. Cloud-first development ● Is a yearly (bi-yearly?) release process the right cadence for companies building cloud services?
  • 47.
    © DataStax, AllRights Reserved. Cloud-first OSS ● What does OSS look like when you don’t work for the big three clouds? ● “Commons Clause” is an attempt to deal with this ○ (What about AGPL?)
  • 48.
    © DataStax, AllRights Reserved. Summary 1. If it’s not SQL, it’s not a database. 2. It takes 5+ years to build a database. 3. Listen to your users. 4. Too much magic is a bad thing. 5. It’s the cloud, stupid.