Five Lessons in Distributed Databases

Five Lessons
in Distributed Databases
Jonathan Ellis
CTO, DataStax
1 © DataStax, All Rights Reserved. Confidential

© DataStax, All Rights Reserved.
1. If it’s not SQL, it’s not a database

A brief history of NoSQL
● Early 2000s: people hit limits on vertical scaling, start
sharding RDBMSes
● 2006, 2007: BigTable, Dynamo papers
● 2008-2010: Explosion of scale-out systems
○ Voldemort, Riak, Dynomite, FoundationDB, CouchDB
○ Cassandra, HBase, MongoDB

One small problem

Cassandra’s experience
● Thrift RPC “drivers” too low level
● Fragmented: Hector, Pelops, Astyanax
● Inconsistent across language ecosystems

Solution: CQL
● 2011: Cassandra 0.8 introduces CQL 1.0
● 2012: Cassandra 1.1 introduces CQL 3.0
● 2013: Cassandra 1.2 adds collections

Today
● Cassandra: CQL
● CosmosDB: “SQL”
● Cloud Spanner: “SQL”
● Couchbase: N1QL
● HBase: Phoenix SQL (Java only)
● DynamoDB: REST/JSON
● MongoDB: BSON

2. It takes 5+ years to build a database

Curt Monash
Rule 1: Developing a good DBMS requires 5-7 years and
tens of millions of dollars.
That’s if things go extremely well.
Rule 2: You aren’t an exception to Rule 1.

Aside: Mistakes I made starting DataStax
● Stayed at Rackspace too long
● Raised a $2.5M series A
● Waited a year to get serious about enterprise sales
● Changed the company name
● Brisk

Examples (Curt)
● Concurrent workloads benchmarked in the lab are poor
predictors of concurrent performance in real life.
● Mixed workload management is harder than you’re
assuming it is.
● Those minor edge cases in which your Version 1
product works poorly aren’t minor after all.

Examples (Cassandra)
● Hinted handoff
● Repair
● Counters
● Paxos
● Test suite

Aside: Fallout (Jepsen at Scale)
● Ensemble - A set of clusters that is brought up/torn
down each test
○ Server Cluster - Cassandra/DSE
○ Client Cluster - Load Generators
○ Observer Cluster - Records live information from clusters (OpsCenter/Graphite)
○ Controller - Fallout
● Workload - The guts of the test
○ Phases - Run sequentially. Contains one or more modules that run in parallel for that
phase
○ Checkers - Run after all phases and verify the data emitted from modules.
○ Artifact Checkers - Runs against collected artifacts to look for correctness/problems

A simple Fallout workload
ensemble:
server:
node.count: 3
provisioner:
name: local
configuration_manager:
name: ccm
properties:
cassandra.version: 3.0.0
client: server #use server cluster
phases:
- insert_workload:
module: stress
properties:
iterations: 1m
type: write
rf: 3
gossip_updown:
module: nodetool
properties:
command: disablegossip
secondary.command: enablegossip
sleep.seconds: 10
sleep.randomize: 20
- read_workload:
module: stress
properties:
iterations: 1m
type: read
checkers:
verify_success:
checker: nofail
1. Start 3 node ccm cluster.
2. Insert data while bringing
gossip on the nodes up and
down.
3. Read/Check the data.
4. Verify none of the steps
failed.
Note: to move from ccm to ec2
we only need to change the
ensemble section.

5-7 years?
● Cassandra became Apache TLP in Feb 2010
● 3.0 released Fall 2015
● OSS is about adoption, not saving time/money

3. The customer is always right

Example: sequential scans
SELECT * FROM user_purchases
WHERE purchase_date > 2000

What’s wrong with this query?
For 100,000 purchases, nothing.
For 100,000,000 purchases, you’ll crash the server
(in 2012).

Solution (2012): ALLOW FILTERING
SELECT * FROM user_purchases
WHERE purchase_date > 2000
ALLOW FILTERING

Better solution (2013): Paging
● Build resultset incrementally and “page” it to the client

Example: tombstones
INSERT INTO foo VALUES (1254, …)
DELETE FROM foo WHERE id = 1254
…
SELECT * FROM foo

Solution (2013)
tombstone_warn_threshold: 1000
tombstone_failure_threshold: 100000

Better Solution (???): It’s complicated
● Track repair status to get rid of GCGS
● Bring time-to-repair from “days” to “hours”
● Optional: improve time-to-compaction

Example: joins
● CQL doesn’t support joins
● People still use client-side joins instead of
denormalizing

Solution (2015-???): MV
● Make it easier to denormalize

Better solution (???): actually add joins
● Less controversial: shared partition joins
● More controversial: cross-partition
● CosmosDB, Spanner

A note on configurability

4. Too much magic is a bad thing

Not (just) about vendors overpromising
● “Our database isn’t subject to the limits of the CAP
theorem”
● “Our queue can guarantee exactly once delivery”
● “We’ll give you 99.99% uptime*”

Magic can be bad even when it works

Cloud Spanner analysis excerpt
Spanner’s architecture implies that writes will be significantly slower
than reads due to the need to coordinate across multiple replicas and
avoid overlapping time bounds, and that is what we see in the original
2012 Spanner paper.
… Besides write performance in isolation, because Spanner uses
pessimistic locking to achieve ACID, reads are locked out of rows
(partitions?) that are in the process of being updated. Thus, write
performance challenges can spread to causing problems with reads as
well.

Cloud Spanner

Auto-scaling in DynamoDB
● Request capacity tied to “partitions” [pp]
○ pp count = max (rc / 3000, wc / 1000, st / 10 GB)
● Subtle implication: capacity / pp decreases as storage
volume increases
○ Non-uniform: pp request capacity halved when shard splits
● Subtle implication 2: bulk loads will wreck your planning

“Best practices for tables”
● Bulk load 200M items = 200 GB
● Target 60 minutes = 55,000 write capacity = 55 pps
● Post bulk load steady state
● 1000 req/s = 2 req/pp = 2 req/(3.6M items)
● No way to reduce partition count

Ravelin, 2017
You construct a table which uses a customer ID as partition key. You
know your customer ID’s are unique and should be uniformly
distributed across nodes. Your business has millions of customers and
no single customer can do so many actions so quickly that the
individual could create a hot key. Under this key you are storing around
2KB of data.
This sounds reasonable.
This will not work at scale in DynamoDb.

How much magic is too much?
● Joins: Apparently okay
● Auto-scaling: Apparently also okay
● Automatic partitioning: not okay
● Really slow ACID: not okay (?)
● Why?
● How do we make the system more transparent without
inflicting an unnecessary level of detail on the user?

5. It’s the cloud, stupid

September 2011

March 2012

The cloud is here. Now what?

Cloud-first architecture
“The second trend will be the increased
prevalence of shared-disk distributed
DBMS. By “shared-disk” I mean a DBMS
that uses a distributed storage layer as its
primary storage location, such as HDFS or
Amazon’s EBS/S3 services. This
separates the DBMS’s storage layer from
its execution nodes. Contrast this with a
shared-nothing DBMS architecture where
each execution node maintains its own
storage.”

Cloud-first infrastructure
● What on-premises infrastructure can provide a
cloud-like experience?
● Kubernetes?
● OpenStack?

Cloud-first development
● Is a yearly (bi-yearly?) release process the right
cadence for companies building cloud services?

Cloud-first OSS
● What does OSS look like when you don’t work for the
big three clouds?
● “Commons Clause” is an attempt to deal with this
○ (What about AGPL?)

Summary
1. If it’s not SQL, it’s not a database.
2. It takes 5+ years to build a database.
3. Listen to your users.
4. Too much magic is a bad thing.
5. It’s the cloud, stupid.

Five Lessons in Distributed Databases

More Related Content

What's hot

Similar to Five Lessons in Distributed Databases

More from jbellis

Recently uploaded

Five Lessons in Distributed Databases