Using Control Theory to Keep
Compactions Under Control
Glauber Costa - VP of Field Engineering, ScyllaDB
WEBINAR
Glauber Costa
2
Glauber Costa is the VP of Field Engineering at ScyllaDB.
He shares his time between the engineering department
working on upcoming Scylla features and helping
customers succeed.
Before ScyllaDB, Glauber worked with Virtualization in the
Linux Kernel for 10 years with contributions ranging from
the Xen Hypervisor to all sorts of guest functionality and
containers
3
+ Next-generation NoSQL database
+ Drop-in replacement for Cassandra
+ 10X the performance & low tail latency
+ Open source and enterprise editions
+ Founded by the creators of KVM hypervisor
+ HQs: Palo Alto, CA; Herzelia, Israel
+ Scylla Summit 2018 November 6-7, SF Bay
About ScyllaDB
Join real-time big-data database developers and users from start-ups
and leading enterprises from around the globe for two days of sharing
ideas, hearing innovative use cases, and getting practical tips and tricks
from your peers and NoSQL gurus.
What are compactions ?
5
Scylla’s write path:
5
Writes
commit log
compaction
6
Compaction Strategy
+ Which sstables to compact, and when?
+ This is called the compaction strategy
+ The goal of the strategy is low amplification:
+ Avoid read requests needing many sstables: read amplification
+ Avoid overwritten/deleted/expired data staying on disk.
+ Avoid excessive temporary disk space needs: space amplification
+ Avoid compacting the same data again and again : write amplification
7
The main compaction strategies
+ Size Tiered Compaction Strategy
+ compact SSTables with roughly the same size together
+ Leveled Compaction Strategy
+ compact SSTables keeping them in different levels that are exponentially bigger
+ Time Window Compaction Strategy
+ Each user-defined time window has a single SSTable
+ Major, or manual compaction
+ compacts everything in a single* SSTable
8
The main compaction strategies
+ Size Tiered Compaction Strategy
+ compact SSTables with roughly the same size together
+ Leveled Compaction Strategy
+ compact SSTables keeping them in different levels that are exponentially bigger
+ Time Window Compaction Strategy
+ Each user-defined time window has a single SSTable
+ Major, or manual compaction
+ compacts everything in a single* SSTable
* see next slide
9
Compactions in Scylla
+ Because all data is sharded, so are SSTables
+ and as a result, so are compactions
+ in a system with 64 vCPUS - expect 64 SSTables after a major compaction
+ same logic for LeveledCompactionStrategy for amount of tables in each level.
Impact of compactions
10
+ Compaction too slow: reads will touch from many SSTables and be slower.
+ Compactions too fast : foreground workload will be disrupted.
Impact of compactions
11
+ Compaction too slow: reads will touch from many SSTables and be slower.
+ Compactions too fast : foreground workload will be disrupted.
+ Common solutions is to use limits. Ex: Apache Cassandra
+ “Don’t allow compactions to run at more than 300 MB/s”
+ But how to find that number?
+ But what if the workload changes?
+ But what if there is idle time now?
Impact of compactions
12
+ Compaction too slow: reads will touch from many SSTables and be slower.
+ Compactions too fast : foreground workload will be disrupted.
+ Common solutions is to use limits. Ex: Apache Cassandra
+ “Don’t allow compactions to run at more than 300 MB/s”
+ But how to find that number?
+ But what if the workload changes?
+ But what if there is idle time now?
+ Another solution is to determine ratios. Ex: ScyllaDB until 2.2
+ “Don’t allow compactions to use more than 20% of storage bandwidth/CPU”
+ Much better, adapts automatically to resource capacity, use idle time efficiently
+ But no temporal knowledge.
Compactions over time
13
Compactions run. Limited impact,
but still impact
Compactions over time
14
All shards are compacting here Almost no shards are
compacting here
What is Control Theory ?
15
+ Open-loop control system
+ there is some input, a function is applied, there is an output.
+ ex: toaster
+ Closed-loop control systems
+ We want the world to be in a particular state.
+ The current state of the world is fed-back to the control system
+ The control system acts to bring the system back to the goal
Feedback Control Systems
16
1. Measure the state of the world
2. Transfer function
3. Actuator
Measuring - current state of all SSTables
17
Partial New SSTable Size
Static SSTable Size
SSTable Uncompacted Size
SSTable Uncompacted Size
Partial compacted
SSTable Size
Actuators - Schedulers
18
Query
Commitlog
Compaction
Queue
Queue
Queue
Userspace
I/O
Scheduler
Storage
Max useful disk concurrency
I/O queued in FS/deviceNo queues
Transfer Function - Backlog
19
+ Each compaction strategy does a different amount of work
+ For each compaction strategy we determine when there is no more work to be done.
+ Examples:
+ SizeTiered: there is only one SSTable in the system.
+ TimeWindow: there is only one SSTable per Time Window.
+ The backlog is: how many bytes I expect to write to reach the state of zero backlog ?
+ Controller output: f(B)
+ proportional function
Transfer Function - Backlog
20
+ Each compaction strategy does a different amount of work
+ For each compaction strategy we determine when there is no more work to be done.
+ Examples:
+ SizeTiered: there is only one SSTable in the system.
+ TimeWindow: there is only one SSTable per Time Window.
+ The backlog is: how many bytes I expect to write to reach the state of zero backlog ?
+ Controller output: f(B)
+ proportional function
+ This is a self-regulating system:
+ more compaction shares = less new writes = less compaction backlog
+ less compaction shares = more new writes = more compaction backlog
SizeTiered Backlog example
21
SizeTiered Backlog
22
+ each byte that is written now is rewritten T times, where T is the number of tiers
+ In SizeTiered, tiers are proportinal to SSTable Sizes.
SizeTiered Backlog
23
+ each byte that is written now is rewritten T times, where T is the number of tiers
+ In SizeTiered, tiers are proportinal to SSTable Sizes.
+ Number of tiers is roughly proportional to the log of this SSTable contribution to the total size
+ Ex: 4 SSTables with 1GB, 4 SSTables with 4GB. Total size = 20GB
+ log4(20 / 1) ~ 2
+ log4(20 / 4) ~ 1
SizeTiered Backlog
24
+ each byte that is written now is rewritten T times, where T is the number of tiers
+ In SizeTiered, tiers are proportinal to SSTable Sizes.
+ Number of tiers is roughly proportional to the log of this SSTable contribution to the total size
+ Ex: 4 SSTables with 1GB, 4 SSTables with 4GB. Total size = 20GB
+ log4(20 / 1) ~ 2
+ log4(20 / 4) ~ 1
+ Backlog for one SSTable is its size, times the backlog per byte:
+ B = SSTableSize * log4(TableSize / SSTableSize)
SizeTiered Backlog
25
+ each byte that is written now is rewritten T times, where T is the number of tiers
+ In SizeTiered, tiers are proportinal to SSTable Sizes.
+ Number of tiers is roughly proportional to the log of this SSTable contribution to the total size
+ Ex: 4 SSTables with 1GB, 4 SSTables with 4GB. Total size = 20GB
+ log4(20 / 1) ~ 2
+ log4(20 / 4) ~ 1
+ Backlog for one SSTable is its size, times the backlog per byte:
+ B = SSTableSize * log4(TableSize / SSTableSize)
+ Backlog for the Entire Table is the Sum of all backlogs for that SSTable.
Results: before vs after
26
Results: throughput vs CPU
27
% CPU time used by Compactions
Throughput
Results, changing workload
28
28
Workload changes:
- automatic adjustment
- new equilibrium
Results - impact on latency
29
2929
2ms : 99.9 % latencies at 100 % load
< 2ms : 99 % latencies,
1ms : 95 % latencies.
30
Q&A
Stay in touch
Join us at Scylla Summit 2018
Pullman San Francisco Bay Hotel | November 6-7
scylladb.com/scylla-summit-2018
glauber@scylladb.com
@ScyllaDB
@glcst
United States
1900 Embarcadero Road
Palo Alto, CA 94303
Israel
11 Galgalei Haplada
Herzelia, Israel
www.scylladb.com
@scylladb
Thank You!

Webinar: Using Control Theory to Keep Compactions Under Control

  • 1.
    Using Control Theoryto Keep Compactions Under Control Glauber Costa - VP of Field Engineering, ScyllaDB WEBINAR
  • 2.
    Glauber Costa 2 Glauber Costais the VP of Field Engineering at ScyllaDB. He shares his time between the engineering department working on upcoming Scylla features and helping customers succeed. Before ScyllaDB, Glauber worked with Virtualization in the Linux Kernel for 10 years with contributions ranging from the Xen Hypervisor to all sorts of guest functionality and containers
  • 3.
    3 + Next-generation NoSQLdatabase + Drop-in replacement for Cassandra + 10X the performance & low tail latency + Open source and enterprise editions + Founded by the creators of KVM hypervisor + HQs: Palo Alto, CA; Herzelia, Israel + Scylla Summit 2018 November 6-7, SF Bay About ScyllaDB
  • 4.
    Join real-time big-datadatabase developers and users from start-ups and leading enterprises from around the globe for two days of sharing ideas, hearing innovative use cases, and getting practical tips and tricks from your peers and NoSQL gurus.
  • 5.
    What are compactions? 5 Scylla’s write path: 5 Writes commit log compaction
  • 6.
    6 Compaction Strategy + Whichsstables to compact, and when? + This is called the compaction strategy + The goal of the strategy is low amplification: + Avoid read requests needing many sstables: read amplification + Avoid overwritten/deleted/expired data staying on disk. + Avoid excessive temporary disk space needs: space amplification + Avoid compacting the same data again and again : write amplification
  • 7.
    7 The main compactionstrategies + Size Tiered Compaction Strategy + compact SSTables with roughly the same size together + Leveled Compaction Strategy + compact SSTables keeping them in different levels that are exponentially bigger + Time Window Compaction Strategy + Each user-defined time window has a single SSTable + Major, or manual compaction + compacts everything in a single* SSTable
  • 8.
    8 The main compactionstrategies + Size Tiered Compaction Strategy + compact SSTables with roughly the same size together + Leveled Compaction Strategy + compact SSTables keeping them in different levels that are exponentially bigger + Time Window Compaction Strategy + Each user-defined time window has a single SSTable + Major, or manual compaction + compacts everything in a single* SSTable * see next slide
  • 9.
    9 Compactions in Scylla +Because all data is sharded, so are SSTables + and as a result, so are compactions + in a system with 64 vCPUS - expect 64 SSTables after a major compaction + same logic for LeveledCompactionStrategy for amount of tables in each level.
  • 10.
    Impact of compactions 10 +Compaction too slow: reads will touch from many SSTables and be slower. + Compactions too fast : foreground workload will be disrupted.
  • 11.
    Impact of compactions 11 +Compaction too slow: reads will touch from many SSTables and be slower. + Compactions too fast : foreground workload will be disrupted. + Common solutions is to use limits. Ex: Apache Cassandra + “Don’t allow compactions to run at more than 300 MB/s” + But how to find that number? + But what if the workload changes? + But what if there is idle time now?
  • 12.
    Impact of compactions 12 +Compaction too slow: reads will touch from many SSTables and be slower. + Compactions too fast : foreground workload will be disrupted. + Common solutions is to use limits. Ex: Apache Cassandra + “Don’t allow compactions to run at more than 300 MB/s” + But how to find that number? + But what if the workload changes? + But what if there is idle time now? + Another solution is to determine ratios. Ex: ScyllaDB until 2.2 + “Don’t allow compactions to use more than 20% of storage bandwidth/CPU” + Much better, adapts automatically to resource capacity, use idle time efficiently + But no temporal knowledge.
  • 13.
    Compactions over time 13 Compactionsrun. Limited impact, but still impact
  • 14.
    Compactions over time 14 Allshards are compacting here Almost no shards are compacting here
  • 15.
    What is ControlTheory ? 15 + Open-loop control system + there is some input, a function is applied, there is an output. + ex: toaster + Closed-loop control systems + We want the world to be in a particular state. + The current state of the world is fed-back to the control system + The control system acts to bring the system back to the goal
  • 16.
    Feedback Control Systems 16 1.Measure the state of the world 2. Transfer function 3. Actuator
  • 17.
    Measuring - currentstate of all SSTables 17 Partial New SSTable Size Static SSTable Size SSTable Uncompacted Size SSTable Uncompacted Size Partial compacted SSTable Size
  • 18.
  • 19.
    Transfer Function -Backlog 19 + Each compaction strategy does a different amount of work + For each compaction strategy we determine when there is no more work to be done. + Examples: + SizeTiered: there is only one SSTable in the system. + TimeWindow: there is only one SSTable per Time Window. + The backlog is: how many bytes I expect to write to reach the state of zero backlog ? + Controller output: f(B) + proportional function
  • 20.
    Transfer Function -Backlog 20 + Each compaction strategy does a different amount of work + For each compaction strategy we determine when there is no more work to be done. + Examples: + SizeTiered: there is only one SSTable in the system. + TimeWindow: there is only one SSTable per Time Window. + The backlog is: how many bytes I expect to write to reach the state of zero backlog ? + Controller output: f(B) + proportional function + This is a self-regulating system: + more compaction shares = less new writes = less compaction backlog + less compaction shares = more new writes = more compaction backlog
  • 21.
  • 22.
    SizeTiered Backlog 22 + eachbyte that is written now is rewritten T times, where T is the number of tiers + In SizeTiered, tiers are proportinal to SSTable Sizes.
  • 23.
    SizeTiered Backlog 23 + eachbyte that is written now is rewritten T times, where T is the number of tiers + In SizeTiered, tiers are proportinal to SSTable Sizes. + Number of tiers is roughly proportional to the log of this SSTable contribution to the total size + Ex: 4 SSTables with 1GB, 4 SSTables with 4GB. Total size = 20GB + log4(20 / 1) ~ 2 + log4(20 / 4) ~ 1
  • 24.
    SizeTiered Backlog 24 + eachbyte that is written now is rewritten T times, where T is the number of tiers + In SizeTiered, tiers are proportinal to SSTable Sizes. + Number of tiers is roughly proportional to the log of this SSTable contribution to the total size + Ex: 4 SSTables with 1GB, 4 SSTables with 4GB. Total size = 20GB + log4(20 / 1) ~ 2 + log4(20 / 4) ~ 1 + Backlog for one SSTable is its size, times the backlog per byte: + B = SSTableSize * log4(TableSize / SSTableSize)
  • 25.
    SizeTiered Backlog 25 + eachbyte that is written now is rewritten T times, where T is the number of tiers + In SizeTiered, tiers are proportinal to SSTable Sizes. + Number of tiers is roughly proportional to the log of this SSTable contribution to the total size + Ex: 4 SSTables with 1GB, 4 SSTables with 4GB. Total size = 20GB + log4(20 / 1) ~ 2 + log4(20 / 4) ~ 1 + Backlog for one SSTable is its size, times the backlog per byte: + B = SSTableSize * log4(TableSize / SSTableSize) + Backlog for the Entire Table is the Sum of all backlogs for that SSTable.
  • 26.
  • 27.
    Results: throughput vsCPU 27 % CPU time used by Compactions Throughput
  • 28.
    Results, changing workload 28 28 Workloadchanges: - automatic adjustment - new equilibrium
  • 29.
    Results - impacton latency 29 2929 2ms : 99.9 % latencies at 100 % load < 2ms : 99 % latencies, 1ms : 95 % latencies.
  • 30.
    30 Q&A Stay in touch Joinus at Scylla Summit 2018 Pullman San Francisco Bay Hotel | November 6-7 scylladb.com/scylla-summit-2018 glauber@scylladb.com @ScyllaDB @glcst
  • 31.
    United States 1900 EmbarcaderoRoad Palo Alto, CA 94303 Israel 11 Galgalei Haplada Herzelia, Israel www.scylladb.com @scylladb Thank You!