APACHE BOOKKEEPER KV
STORE AND USE CASES
SHIVJI KUMAR JHA
@ShivjiJha
TRACK: MESSAGING
in/shivjijha
About Me
• Senior MTS at Nutanix
• Platform Engineer
– DBs, SOA, Infra, Streams
• Love
– Distributed data systems
– Open-source software (OSS)
• OSS Contributions
– Apache Pulsar
– MySQL
Contents
Why KV
store?
What is
bookkeeper?
How to use
bookkeeper?
History of Data Stores
4
A Brief History…
Of Databases
• 1960: Flat Files
• 1960s: Hierarchical Databases
• 1980: SQL / Relational Databases
– High-level language
– Abstractions: Schema, Transactions, Indexes
• 2004: NoSQL
– Scale & Availability above all
– No relational model
• 2010s: Distributed SQL
Image source: https://commons.wikimedia.org/wiki/File:Human_evolution.svg
A Brief History…
Of Data Streams
• Apache Kafka:
– Built inside LinkedIn
– 2011: Kafka becomes open source
– 2012: Graduated from Apache incubator
• Apache Pulsar
– Built at Yahoo
– 2016: Contributed to Open source
– 2018: Top-level Apache project
Image source: https://commons.wikimedia.org/wiki/File:Human_evolution.svg
A Brief History…
Of Apache Bookkeeper
• Born at Yahoo! Research
• Evolved from Apache Zookeeper (ZK)
• 2011: Incubated as subproject under ZK
• 2015: Top level Apache Project
Apache Bookkeeper
What is Bookkeeper?
• Infinite Stream of log records
• Horizontally scalable storage
• Fault-tolerant
• Low latency writes
• Offers
– Durability
– Tunable replication
– Strong consistency
Use cases
• As write ahead log (WAL) in
– HDFS namenode (first use case)
– Twitter’s Manhattan : distributed KV
– HerdDB : JVM embeddable distributed
database
• Apache Pulsar : Message & Offset store
• Salesforce : Internal database of
application storage
• Pravega (DellEMC) : Message store
• Bytedance : Internal metadata store
B-tree vs LSM
• Primary data structures for storage engines.
• B-trees behind traditional databases
– MySQL, PostgreSQL
– Indexing for expensive random access on
HDD
• Log structured Merge (LSM) trees
– Good write throughput
– Behind variety of the modern workloads
• Stream : Apache Bookkeeper, Kafka
Streams, Apache Pulsar, Flink,
• OLTP : MyRocks, MongoRocks,
Rocksandra, YugaByte, CockroachDB
• TSDB : influxDB
– Take advantage of SSD throughput
Key Value stores
• KV stores as common core behind:
– Key Value databases
– Relational databases
• Key : Primary Key, Value: Complete row
– Document databases
• Key : Primary Key (internal?), Value: document
– Streaming Platforms
• rocksDB based : Apache Pulsar, Kafka Streams, Flink
• Good idea to have less clusters!
• Good idea to have same base (KV) across clusters!
Bookkeeper = ZK + rocksDB
RocksDB
• Implements LSM
• Embeddable
• Key Value store
• Append only
– Low latency
– High throughput
• Duplicate record for update / delete
• Compaction to remove stale /
deleted records
Zookeeper
• Metadata store
• Cluster coordination
• Service discovery
• Leader election
• Dynamic configurations
• Feature flags
Bookkeeper Internals
18
Bookkeeper Cluster : Replication
https://medium.com/streamnative/why-apache-bookkeeper-part-1-consistency-durability-availability-ac697a3cf7a1
Bookkeeper : Typical Usage
https://medium.com/streamnative/why-apache-bookkeeper-part-1-consistency-durability-availability-ac697a3cf7a1
Bookkeeper Glossary
Entries
Actual data (bytes) written to ledgers.
Plus, metadata
Entry: [ledgerId, entryId, Checksum…]
Entry Log File
Actual physical file with entries
Offsets indexed for fast lookup.
Asynchronous garbage collection of
deleted and stale entries.
Bookkeeper Glossary
Journal
Transaction logs (Write ahead log)
Append only semantics
Low latency, high throughput writes
Turn on / off (durability vs
throughput)
Ledger
Logical unit of storage for APIs in bookkeeper.
Append-only semantics
Indexed & cached for faster lookups
Includes:[Status, lastEntryId, [entries] replication
factors…]
Bookkeeper : Client & Server
•Bookkeeper has no leader / follower.
•Same responsibility across nodes.
•Thick bookie client implements replication, coordination, consistency.
•Separate Auto detection and restore module if entries lost.
Client Based Replication
•Create ledger (sync / async)
•Append entry to ledger
•Read entry from ledger
•Delete Ledger (sync / async)
Bookkeeper APIs
Bookkeeper Server : Write Path
BOOKKEEPER
CLIENT
Bookkeeper Server
Bookkeeper Client
Journal (WAL)
Bookkeeper Server : Write Path
BOOKKEEPER
CLIENT
Bookkeeper Server
Bookkeeper Client
Journal (WAL)
LEDGER APIs
Writes
Bookkeeper Server : Append only
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL)
LEDGER APIs
Writes
Bookkeeper Server : Write Path
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL) Write Cache
LEDGER APIs
Writes
Bookkeeper Server : Read-Write
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL) Write Cache
Read Cache
LEDGER APIs
LEDGER APIs
Writes
Reads
Entry
Log
Files
Bookkeeper Server : IO isolation
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL) Write Cache
Read Cache
LEDGER APIs
LEDGER APIs
disk
disk
Writes
Reads
Entry
Log
Files
Bookkeeper Server : Read Path
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL) Write Cache
Entry
Log
Files
Read Cache
LEDGER APIs
Reads
index
Bookkeeper Server : Flush
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL) Write Cache
Entry
Log
Files
Read Cache
LEDGER APIs
Reads
Asynchronous, batched flush!
Bookkeeper : Offsets
• Sent in response to write()
• Cumulative ack
• Readers can read until LAC
Last add confirmed (LAC)
• Last entry client requested to write.
• Write in progress, not acked yet.
Last add pushed (LAP)
READERS
LAC LAP
WRITER
Entries
Bookkeeper : Recovery
READERS
LAC LAP
WRITER
Entries
Bookkeeper : Recovery
•Writer crashed / network partition
•Client retries / fails
•Retry reaches new bookkeeper node
Bookkeeper Failure
•Put Ledger state in recovery
•Fences old file with consensus.
•Write to new file
•New owner back ? Split brain?
New Bookkeeper owner
READERS
LAC LAP
WRITER
Entries
NEW
WRITER
Bookkeeper:
A Pulsar Use case
35
Apache Pulsar 101
PRODUCER CONSUMER
• Cloud-native,
• Distributed messaging and
• Distributed streaming platform
Apache Pulsar
• Modular Design
• Horizontally scalable
• Low latency & high throughput
• Multi-tenancy
• Geo Replication
Highlights
Apache Pulsar 101
PRODUCER CONSUMER
BROKER
BOOKKEEPER
ZOOKEEPER
Bookkeeper Server : Read-Write
BOOKKEEPER
CLIENT
Bookkeeper Client Bookkeeper Server
Journal (WAL) Write Cache
Read Cache
LEDGER APIs
LEDGER APIs
Writes
Reads
Entry
Log
Files
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Bookkeeper Server
Journal (WAL) Write Cache
Read Cache
LEDGER APIs
LEDGER APIs
Writes
Reads
Entry
Log
Files
BROKER
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Bookkeeper Server
Journal (WAL) Write Cache
Read Cache
LEDGER APIs
LEDGER APIs
Writes
Reads
Entry
Log
Files
BROKER
TOPIC1 TOPIC2 TOPIC3
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Bookkeeper Server
Journal (WAL) Write Cache
Read Cache
LEDGER APIs
LEDGER APIs
Writes
Reads
Entry
Log
Files
BROKER
TOPIC1 TOPIC2 TOPIC3
PRODUCER CONSUMER
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Topic Ledger Mapping
BROKER
TOPIC1 TOPIC2 TOPIC3
TOPIC 3
MANAGED LEDGER
PRODUCER CONSUMER
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Topic Ledger Mapping
BROKER
TOPIC1 TOPIC2 TOPIC3
TOPIC 3
MANAGED LEDGER
PRODUCER CONSUMER
Ledgers[]
schemaLedgers[]
compactedLedgers[]
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Topic Ledger Mapping
BROKER
TOPIC1 TOPIC2 TOPIC3
TOPIC 3
MANAGED LEDGER
PRODUCER CONSUMER
Ledgers[]
schemaLedgers[]
compactedLedgers[]
ledgerId,
entriesRange,
Ledger size, metadata
Pulsar Broker & Bookkeeper
BOOKKEEPER
CLIENT
Pulsar Broker Topic Ledger Mapping
BROKER
TOPIC1 TOPIC2 TOPIC3
TOPIC 3
MANAGED LEDGER
PRODUCER CONSUMER
Ledgers[]
schemaLedgers[]
compactedLedgers[]
ledgerId,
entriesRange,
Ledger size, offloaded?
CURSOR 1 CURSOR 2
CONSUMER 1 CONSUMER 1
Cluster Coordination: Zookeeper
• Pointers to data
– Topic ledgers mapping
– Ledger topics mapping
– Topic schema mapping
• Service Discovery
– List of available bookies
– List of available brokers
– Which broker owns which topic
– How much load on which topic etc
• Distributed coordination
– Locks
– Leader election
• System Configuration
– Dynamic configs for hot reload
– Feature flags
• Provisioning Configuration
– Metadata for tenants, namespaces
– Namespace policies
Summary
• Plethora of databases, workloads, use cases.
– Too many clusters – difficult to operate
• RocksDB : very popular LSM implementation
– High write throughput, leverages SSD throughput
– Varied workloads on rocksDB : databases, queues, streams
• Bookkeeper : Consistent distributed KV base
– Infinite commit log
– Can use in a lot of different ways
– Apache Pulsar is one example, but a lot more building up!
– Fault tolerant, horizontally scalable store behind Pulsar
References
1. Mark Callaghan - Choosing between Efficiency and
Performance with RocksDB
2. FoundationDB Record Layer – White paper
3. Why Apache Bookkeeper part 1 :
consistency,durability,availability By Sijie Guo
4. Understanding How Apache Pulsar works By Jack Vanlightly
5. How Pulsar stores your data – Pulsar Summit NA 2021 By
Shivji Kumar Jha
6. Convergence of Messaging, streaming and storage By Sijie
Guo
THANK YOU
QUESTIONS?
@ShivjiJha
shiv4289
in/shivjijha/
ShivjiKumarJha

Apache Con 2021 : Apache Bookkeeper Key Value Store and use cases

  • 1.
    APACHE BOOKKEEPER KV STOREAND USE CASES SHIVJI KUMAR JHA @ShivjiJha TRACK: MESSAGING in/shivjijha
  • 2.
    About Me • SeniorMTS at Nutanix • Platform Engineer – DBs, SOA, Infra, Streams • Love – Distributed data systems – Open-source software (OSS) • OSS Contributions – Apache Pulsar – MySQL
  • 3.
  • 4.
  • 5.
    A Brief History… OfDatabases • 1960: Flat Files • 1960s: Hierarchical Databases • 1980: SQL / Relational Databases – High-level language – Abstractions: Schema, Transactions, Indexes • 2004: NoSQL – Scale & Availability above all – No relational model • 2010s: Distributed SQL Image source: https://commons.wikimedia.org/wiki/File:Human_evolution.svg
  • 6.
    A Brief History… OfData Streams • Apache Kafka: – Built inside LinkedIn – 2011: Kafka becomes open source – 2012: Graduated from Apache incubator • Apache Pulsar – Built at Yahoo – 2016: Contributed to Open source – 2018: Top-level Apache project Image source: https://commons.wikimedia.org/wiki/File:Human_evolution.svg
  • 7.
    A Brief History… OfApache Bookkeeper • Born at Yahoo! Research • Evolved from Apache Zookeeper (ZK) • 2011: Incubated as subproject under ZK • 2015: Top level Apache Project
  • 8.
    Apache Bookkeeper What isBookkeeper? • Infinite Stream of log records • Horizontally scalable storage • Fault-tolerant • Low latency writes • Offers – Durability – Tunable replication – Strong consistency Use cases • As write ahead log (WAL) in – HDFS namenode (first use case) – Twitter’s Manhattan : distributed KV – HerdDB : JVM embeddable distributed database • Apache Pulsar : Message & Offset store • Salesforce : Internal database of application storage • Pravega (DellEMC) : Message store • Bytedance : Internal metadata store
  • 9.
    B-tree vs LSM •Primary data structures for storage engines. • B-trees behind traditional databases – MySQL, PostgreSQL – Indexing for expensive random access on HDD • Log structured Merge (LSM) trees – Good write throughput – Behind variety of the modern workloads • Stream : Apache Bookkeeper, Kafka Streams, Apache Pulsar, Flink, • OLTP : MyRocks, MongoRocks, Rocksandra, YugaByte, CockroachDB • TSDB : influxDB – Take advantage of SSD throughput
  • 10.
    Key Value stores •KV stores as common core behind: – Key Value databases – Relational databases • Key : Primary Key, Value: Complete row – Document databases • Key : Primary Key (internal?), Value: document – Streaming Platforms • rocksDB based : Apache Pulsar, Kafka Streams, Flink • Good idea to have less clusters! • Good idea to have same base (KV) across clusters!
  • 11.
    Bookkeeper = ZK+ rocksDB RocksDB • Implements LSM • Embeddable • Key Value store • Append only – Low latency – High throughput • Duplicate record for update / delete • Compaction to remove stale / deleted records Zookeeper • Metadata store • Cluster coordination • Service discovery • Leader election • Dynamic configurations • Feature flags
  • 12.
  • 13.
    Bookkeeper Cluster :Replication https://medium.com/streamnative/why-apache-bookkeeper-part-1-consistency-durability-availability-ac697a3cf7a1
  • 14.
    Bookkeeper : TypicalUsage https://medium.com/streamnative/why-apache-bookkeeper-part-1-consistency-durability-availability-ac697a3cf7a1
  • 15.
    Bookkeeper Glossary Entries Actual data(bytes) written to ledgers. Plus, metadata Entry: [ledgerId, entryId, Checksum…] Entry Log File Actual physical file with entries Offsets indexed for fast lookup. Asynchronous garbage collection of deleted and stale entries.
  • 16.
    Bookkeeper Glossary Journal Transaction logs(Write ahead log) Append only semantics Low latency, high throughput writes Turn on / off (durability vs throughput) Ledger Logical unit of storage for APIs in bookkeeper. Append-only semantics Indexed & cached for faster lookups Includes:[Status, lastEntryId, [entries] replication factors…]
  • 17.
    Bookkeeper : Client& Server •Bookkeeper has no leader / follower. •Same responsibility across nodes. •Thick bookie client implements replication, coordination, consistency. •Separate Auto detection and restore module if entries lost. Client Based Replication •Create ledger (sync / async) •Append entry to ledger •Read entry from ledger •Delete Ledger (sync / async) Bookkeeper APIs
  • 18.
    Bookkeeper Server :Write Path BOOKKEEPER CLIENT Bookkeeper Server Bookkeeper Client Journal (WAL)
  • 19.
    Bookkeeper Server :Write Path BOOKKEEPER CLIENT Bookkeeper Server Bookkeeper Client Journal (WAL) LEDGER APIs Writes
  • 20.
    Bookkeeper Server :Append only BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) LEDGER APIs Writes
  • 21.
    Bookkeeper Server :Write Path BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) Write Cache LEDGER APIs Writes
  • 22.
    Bookkeeper Server :Read-Write BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) Write Cache Read Cache LEDGER APIs LEDGER APIs Writes Reads Entry Log Files
  • 23.
    Bookkeeper Server :IO isolation BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) Write Cache Read Cache LEDGER APIs LEDGER APIs disk disk Writes Reads Entry Log Files
  • 24.
    Bookkeeper Server :Read Path BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) Write Cache Entry Log Files Read Cache LEDGER APIs Reads index
  • 25.
    Bookkeeper Server :Flush BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) Write Cache Entry Log Files Read Cache LEDGER APIs Reads Asynchronous, batched flush!
  • 26.
    Bookkeeper : Offsets •Sent in response to write() • Cumulative ack • Readers can read until LAC Last add confirmed (LAC) • Last entry client requested to write. • Write in progress, not acked yet. Last add pushed (LAP) READERS LAC LAP WRITER Entries
  • 27.
  • 28.
    Bookkeeper : Recovery •Writercrashed / network partition •Client retries / fails •Retry reaches new bookkeeper node Bookkeeper Failure •Put Ledger state in recovery •Fences old file with consensus. •Write to new file •New owner back ? Split brain? New Bookkeeper owner READERS LAC LAP WRITER Entries NEW WRITER
  • 29.
  • 30.
    Apache Pulsar 101 PRODUCERCONSUMER • Cloud-native, • Distributed messaging and • Distributed streaming platform Apache Pulsar • Modular Design • Horizontally scalable • Low latency & high throughput • Multi-tenancy • Geo Replication Highlights
  • 31.
    Apache Pulsar 101 PRODUCERCONSUMER BROKER BOOKKEEPER ZOOKEEPER
  • 32.
    Bookkeeper Server :Read-Write BOOKKEEPER CLIENT Bookkeeper Client Bookkeeper Server Journal (WAL) Write Cache Read Cache LEDGER APIs LEDGER APIs Writes Reads Entry Log Files
  • 33.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Bookkeeper Server Journal (WAL) Write Cache Read Cache LEDGER APIs LEDGER APIs Writes Reads Entry Log Files BROKER
  • 34.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Bookkeeper Server Journal (WAL) Write Cache Read Cache LEDGER APIs LEDGER APIs Writes Reads Entry Log Files BROKER TOPIC1 TOPIC2 TOPIC3
  • 35.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Bookkeeper Server Journal (WAL) Write Cache Read Cache LEDGER APIs LEDGER APIs Writes Reads Entry Log Files BROKER TOPIC1 TOPIC2 TOPIC3 PRODUCER CONSUMER
  • 36.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Topic Ledger Mapping BROKER TOPIC1 TOPIC2 TOPIC3 TOPIC 3 MANAGED LEDGER PRODUCER CONSUMER
  • 37.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Topic Ledger Mapping BROKER TOPIC1 TOPIC2 TOPIC3 TOPIC 3 MANAGED LEDGER PRODUCER CONSUMER Ledgers[] schemaLedgers[] compactedLedgers[]
  • 38.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Topic Ledger Mapping BROKER TOPIC1 TOPIC2 TOPIC3 TOPIC 3 MANAGED LEDGER PRODUCER CONSUMER Ledgers[] schemaLedgers[] compactedLedgers[] ledgerId, entriesRange, Ledger size, metadata
  • 39.
    Pulsar Broker &Bookkeeper BOOKKEEPER CLIENT Pulsar Broker Topic Ledger Mapping BROKER TOPIC1 TOPIC2 TOPIC3 TOPIC 3 MANAGED LEDGER PRODUCER CONSUMER Ledgers[] schemaLedgers[] compactedLedgers[] ledgerId, entriesRange, Ledger size, offloaded? CURSOR 1 CURSOR 2 CONSUMER 1 CONSUMER 1
  • 40.
    Cluster Coordination: Zookeeper •Pointers to data – Topic ledgers mapping – Ledger topics mapping – Topic schema mapping • Service Discovery – List of available bookies – List of available brokers – Which broker owns which topic – How much load on which topic etc • Distributed coordination – Locks – Leader election • System Configuration – Dynamic configs for hot reload – Feature flags • Provisioning Configuration – Metadata for tenants, namespaces – Namespace policies
  • 41.
    Summary • Plethora ofdatabases, workloads, use cases. – Too many clusters – difficult to operate • RocksDB : very popular LSM implementation – High write throughput, leverages SSD throughput – Varied workloads on rocksDB : databases, queues, streams • Bookkeeper : Consistent distributed KV base – Infinite commit log – Can use in a lot of different ways – Apache Pulsar is one example, but a lot more building up! – Fault tolerant, horizontally scalable store behind Pulsar
  • 42.
    References 1. Mark Callaghan- Choosing between Efficiency and Performance with RocksDB 2. FoundationDB Record Layer – White paper 3. Why Apache Bookkeeper part 1 : consistency,durability,availability By Sijie Guo 4. Understanding How Apache Pulsar works By Jack Vanlightly 5. How Pulsar stores your data – Pulsar Summit NA 2021 By Shivji Kumar Jha 6. Convergence of Messaging, streaming and storage By Sijie Guo
  • 43.