Couchbase usage and perfs
Criteo - Couchbase Live 2016 - Paris
About me
Pierre Mavro - Lead DevOps - NoSQL Team
Working at Criteo as Site Reliability Engineer
@deimosfr
Criteo
31 Offices
2000+ employees
Criteo technical insights
● 700 engineers
● 17K servers
● 27K displays per second
● 2.4M requests per second
Criteo SRE: biggest challenges
● Scaling
● Low latency
● High throughput
● Resiliency
● Automation
Couchbase figures at Criteo (Worldwide)
● 1300+ physical servers
● 100+ clusters (up to 50 servers each)
● 90TB of data in memory
● 25M QPS
● < 8ms constant latency
Couchbase usage at Criteo
● Storing UUIDs < 30b
● Storing blobs (ex. binary images)
● Storing keys size > value data size (sometimes)
● Serving between 100Kqps to 2.5Mqps per cluster
● Low latency at 99perc < 2ms
● Data size per cluster between 500Gb to ~12Tb (with replica)
● All data fits in memory
● Inter datacenter replication (custom client driver)
What we wanted to solve
Legacy infrastructure
● Couchbase v1.8 legacy (80%) and v3.0.1
community (20%)
● Slow rebalance (up to 48h for 1 server)
● Rebalance failures on high loaded
clusters
● Max connection reached on v1.8 (9k)
Legacy infrastructure
● Same cluster shares on persisted and non
persisted buckets
● No dedicated latency monitoring tool
● No auto restart/upgrade orchestrator
● Server benchmarks update required
● Lack of Couchbase best practices
How we achieved the change
1. Benchmarks
Benchmarks
● Couchbase Enterprise 3.1.3
● 3x HP GEN9 DL360 (256GB RAM, 6x400GB SSD RAID10, 1Gb Network
interface) (2x injectors + 1 server)
● Key size: UUID string (36 bytes) + Couchbase metadata (56 bytes)
● Value size: uniform range between 750 B and 1250 B (avg 1 kB)
● Number of items: 50M/node (with replica) or 100M/node (without replica)
● Resident active items (= items fully in RAM): ~50%
● Value-only ejection mode (only data value can be removed from RAM, keeping
metadata + key in RAM).
Benchmarks
Heavy Writes/Little Reads (10Kqps) without replica
Write rate
per node
Status Disk Write
Queue
Latency 50
perc
Latency 95
perc
Latency
99perc
Latency
99,9 perc
40 Kset/s OK 10M items 0.4 ms 0.7 ms 2 ms 8 ms
60 Kset/s OK 30M items 0.4 ms 0.7 ms 2 ms 20 ms
80 Kset/s OK 50M items 0.4 ms 2 ms 7 ms 30 ms
100 Kset/s OK 70M items 1.5 ms 5 ms 10 ms 40 ms
Benchmarks
Heavy Writes/Little Reads (10Kqps) with one replica
Write rate
per node
Status Disk Write
Queue
Latency 50
perc
Latency 95
perc
Latency
99perc
Latency
99,9 perc
20 Kset/s OK 12M items 0.4 ms 1 ms 2 ms 10 ms
30 Kset/s OK 33M items 0.5 ms 2 ms 4 ms 20 ms
40 Kset/s OK 60M items 0.6 ms 2 ms 5 ms 25 ms
50 Kset/s NOK (OOM) >70M items 0.7 ms 5 ms 50 ms 75 ms
Benchmarks
Heavy Reads/Little Writes (10Kqps) with one replica
Read rate
per node
Status Disk Write
Queue
Latency 50
perc
Latency 95
perc
Latency
99perc
Latency
99,9 perc
25 Kset/s OK 130k items 0.4 ms 0.7 ms 4 ms 8 ms
50 Kset/s OK 130k items 0.4 ms 1 ms 5 ms 10 ms
75 Kset/s OK 130k items 0.4 ms 5 ms 15 ms 25 ms
100 Kset/s NOK 50k to 500k
items
16 ms 25 ms 45 ms 100 ms
Benchmarks
Conclusion for a single node:
● Network 1Gb is the bottleneck
● Replicas introduce latency
● Reads are fast
● Max write with replica per node: 40 Kqps
● Max read with replica per node: 90Kqps
● Max read/write without replica per node: 90 Kqps
2. SLI, SLO & SLA
Metrics
Metrics are greats !
● QPS total (read + write)
● Total RAM usage
● Availability
● Number of items
● …
But it’s not relevant enough to know the global service status !
SLI: add the major missing metric
Adding latency monitoring as SLI, to be part of our Couchbase SLO and SLA
3. Couchbase support
Support contract
● Get latest Couchbase bug fixes
● Suggest Couchbase enhancements
● Speed up resolution of incidents with the help of support
● Get better Couchbase tuning recommendations for performance
4. Refactoring infrastructure
Split usages
● High-load (QPS) buckets are on dedicated
clusters
● Low-load (QPS) buckets are shared on
separate “shared” clusters
● Persisted and Non persisted clusters are not
on the same servers anymore
5. Administration Automation
Automation: why?
● Need to upgrade from the community to the enterprise version
● Need to apply new configuration options that require a restart of all the
nodes in a cluster
● Need to apply fixes that require a reboot of all the nodes in a cluster
● Need to reinstall servers from scratch
Automation: how?
● Criteo is using Chef to bootstrap servers, deploy applications and
configuration
● We did not want to add another new tool in the loop
● Nothing with the required features already exists
● We developed a FOSS Chef cookbook for this and other use-cases:
Choregraphie
https://github.com/criteo-cookbooks/choregraphie
Automation: Choregraphie
With Choregraphie we can perform:
● Rolling restart with rebalance
● Rolling upgrade with rebalance
● Use an optional, additional server to speed up rebalance
● Rolling reboot with rebalance
● Rolling reinstall with rebalance
Choregraphie is open source! Feel free to contribute
6. Couchbase and system tuning
Couchbase best practices / system tuning
● Minimize swap usage:
○ vm.swappiness = 0 (set to 1 for kernel >3.5)
● Disable transparent Hugepages:
○ chkconfig disable-thp on
● Set SSD IO-Scheduler to deadline:
○ echo “deadline” > /sys/block/sdX/queue/scheduler
● Change CPUFreq governor:
○ modprobe cpufreq_performance
● Leverage maximum connection:
○ max_conns_on_port_XXXX: 30000
Couchbase tuning
● Upgrade Nonio parameter to 8 to avoid rebalance failures on high-load
clusters:
○ curl -i -u <Administrator>:<pwd> --data 'ns_bucket:update_bucket_props("<bucketname>",
[{extra_config_string, "max_num_nonio=<N>"}]).' http://<NodeIP>:8091/diag/eval
● Disable access log if you don’t need them to reduce disk usage (native in
Couchbase 4.5):
○ curl -i -u <Administrator>:<pwd> --data 'ns_bucket:update_bucket_props("<bucketname>",
[{extra_config_string, "access_scanner_enabled=false"}]).' http://<NodeIP>:8091/diag/eval
Tuning...what’s next?
● Network teaming 802.3ad (bonding) with 2x1Gb cards
● 10Gb network cards
● Upgrade to Couchbase 4.5
● Upgrade kernel to a newer LTS vanilla to enable specific SSD enhancement
(multi queues SSD)
● Switch to Mesos to reduce administration time
Questions ?
Criteo - Couchbase Live 2016 - Paris
Pierre Mavro / @deimosfr

Couchbase live 2016

  • 1.
    Couchbase usage andperfs Criteo - Couchbase Live 2016 - Paris
  • 2.
    About me Pierre Mavro- Lead DevOps - NoSQL Team Working at Criteo as Site Reliability Engineer @deimosfr
  • 3.
  • 4.
    Criteo technical insights ●700 engineers ● 17K servers ● 27K displays per second ● 2.4M requests per second
  • 5.
    Criteo SRE: biggestchallenges ● Scaling ● Low latency ● High throughput ● Resiliency ● Automation
  • 6.
    Couchbase figures atCriteo (Worldwide) ● 1300+ physical servers ● 100+ clusters (up to 50 servers each) ● 90TB of data in memory ● 25M QPS ● < 8ms constant latency
  • 7.
    Couchbase usage atCriteo ● Storing UUIDs < 30b ● Storing blobs (ex. binary images) ● Storing keys size > value data size (sometimes) ● Serving between 100Kqps to 2.5Mqps per cluster ● Low latency at 99perc < 2ms ● Data size per cluster between 500Gb to ~12Tb (with replica) ● All data fits in memory ● Inter datacenter replication (custom client driver)
  • 8.
  • 9.
    Legacy infrastructure ● Couchbasev1.8 legacy (80%) and v3.0.1 community (20%) ● Slow rebalance (up to 48h for 1 server) ● Rebalance failures on high loaded clusters ● Max connection reached on v1.8 (9k)
  • 10.
    Legacy infrastructure ● Samecluster shares on persisted and non persisted buckets ● No dedicated latency monitoring tool ● No auto restart/upgrade orchestrator ● Server benchmarks update required ● Lack of Couchbase best practices
  • 11.
    How we achievedthe change
  • 12.
  • 13.
    Benchmarks ● Couchbase Enterprise3.1.3 ● 3x HP GEN9 DL360 (256GB RAM, 6x400GB SSD RAID10, 1Gb Network interface) (2x injectors + 1 server) ● Key size: UUID string (36 bytes) + Couchbase metadata (56 bytes) ● Value size: uniform range between 750 B and 1250 B (avg 1 kB) ● Number of items: 50M/node (with replica) or 100M/node (without replica) ● Resident active items (= items fully in RAM): ~50% ● Value-only ejection mode (only data value can be removed from RAM, keeping metadata + key in RAM).
  • 14.
    Benchmarks Heavy Writes/Little Reads(10Kqps) without replica Write rate per node Status Disk Write Queue Latency 50 perc Latency 95 perc Latency 99perc Latency 99,9 perc 40 Kset/s OK 10M items 0.4 ms 0.7 ms 2 ms 8 ms 60 Kset/s OK 30M items 0.4 ms 0.7 ms 2 ms 20 ms 80 Kset/s OK 50M items 0.4 ms 2 ms 7 ms 30 ms 100 Kset/s OK 70M items 1.5 ms 5 ms 10 ms 40 ms
  • 15.
    Benchmarks Heavy Writes/Little Reads(10Kqps) with one replica Write rate per node Status Disk Write Queue Latency 50 perc Latency 95 perc Latency 99perc Latency 99,9 perc 20 Kset/s OK 12M items 0.4 ms 1 ms 2 ms 10 ms 30 Kset/s OK 33M items 0.5 ms 2 ms 4 ms 20 ms 40 Kset/s OK 60M items 0.6 ms 2 ms 5 ms 25 ms 50 Kset/s NOK (OOM) >70M items 0.7 ms 5 ms 50 ms 75 ms
  • 16.
    Benchmarks Heavy Reads/Little Writes(10Kqps) with one replica Read rate per node Status Disk Write Queue Latency 50 perc Latency 95 perc Latency 99perc Latency 99,9 perc 25 Kset/s OK 130k items 0.4 ms 0.7 ms 4 ms 8 ms 50 Kset/s OK 130k items 0.4 ms 1 ms 5 ms 10 ms 75 Kset/s OK 130k items 0.4 ms 5 ms 15 ms 25 ms 100 Kset/s NOK 50k to 500k items 16 ms 25 ms 45 ms 100 ms
  • 17.
    Benchmarks Conclusion for asingle node: ● Network 1Gb is the bottleneck ● Replicas introduce latency ● Reads are fast ● Max write with replica per node: 40 Kqps ● Max read with replica per node: 90Kqps ● Max read/write without replica per node: 90 Kqps
  • 18.
  • 19.
    Metrics Metrics are greats! ● QPS total (read + write) ● Total RAM usage ● Availability ● Number of items ● … But it’s not relevant enough to know the global service status !
  • 20.
    SLI: add themajor missing metric Adding latency monitoring as SLI, to be part of our Couchbase SLO and SLA
  • 21.
  • 22.
    Support contract ● Getlatest Couchbase bug fixes ● Suggest Couchbase enhancements ● Speed up resolution of incidents with the help of support ● Get better Couchbase tuning recommendations for performance
  • 23.
  • 24.
    Split usages ● High-load(QPS) buckets are on dedicated clusters ● Low-load (QPS) buckets are shared on separate “shared” clusters ● Persisted and Non persisted clusters are not on the same servers anymore
  • 25.
  • 26.
    Automation: why? ● Needto upgrade from the community to the enterprise version ● Need to apply new configuration options that require a restart of all the nodes in a cluster ● Need to apply fixes that require a reboot of all the nodes in a cluster ● Need to reinstall servers from scratch
  • 27.
    Automation: how? ● Criteois using Chef to bootstrap servers, deploy applications and configuration ● We did not want to add another new tool in the loop ● Nothing with the required features already exists ● We developed a FOSS Chef cookbook for this and other use-cases: Choregraphie https://github.com/criteo-cookbooks/choregraphie
  • 28.
    Automation: Choregraphie With Choregraphiewe can perform: ● Rolling restart with rebalance ● Rolling upgrade with rebalance ● Use an optional, additional server to speed up rebalance ● Rolling reboot with rebalance ● Rolling reinstall with rebalance Choregraphie is open source! Feel free to contribute
  • 29.
    6. Couchbase andsystem tuning
  • 30.
    Couchbase best practices/ system tuning ● Minimize swap usage: ○ vm.swappiness = 0 (set to 1 for kernel >3.5) ● Disable transparent Hugepages: ○ chkconfig disable-thp on ● Set SSD IO-Scheduler to deadline: ○ echo “deadline” > /sys/block/sdX/queue/scheduler ● Change CPUFreq governor: ○ modprobe cpufreq_performance ● Leverage maximum connection: ○ max_conns_on_port_XXXX: 30000
  • 31.
    Couchbase tuning ● UpgradeNonio parameter to 8 to avoid rebalance failures on high-load clusters: ○ curl -i -u <Administrator>:<pwd> --data 'ns_bucket:update_bucket_props("<bucketname>", [{extra_config_string, "max_num_nonio=<N>"}]).' http://<NodeIP>:8091/diag/eval ● Disable access log if you don’t need them to reduce disk usage (native in Couchbase 4.5): ○ curl -i -u <Administrator>:<pwd> --data 'ns_bucket:update_bucket_props("<bucketname>", [{extra_config_string, "access_scanner_enabled=false"}]).' http://<NodeIP>:8091/diag/eval
  • 32.
    Tuning...what’s next? ● Networkteaming 802.3ad (bonding) with 2x1Gb cards ● 10Gb network cards ● Upgrade to Couchbase 4.5 ● Upgrade kernel to a newer LTS vanilla to enable specific SSD enhancement (multi queues SSD) ● Switch to Mesos to reduce administration time
  • 33.
    Questions ? Criteo -Couchbase Live 2016 - Paris Pierre Mavro / @deimosfr