The evolution of the Venice ingestion pipeline
Venice is an open-source derived data storage platform and the default storage solution for online AI use cases at LinkedIn. It powers a wide range of critical LinkedIn products, including People You May Know (PYMK), feed, videos, ads, notifications platform, A/B testing platform (Lix), LinkedIn Learning, and more.
Since launching the Venice project at LinkedIn in 2016, Venice has grown exponentially – from a handful of data stores to over 2,600 production stores internally. Its use cases have evolved from simple bulk-loading workflows to highly complex active-active replication with partial update capabilities.
In this post, we’ll dive into how we revamped the Venice ingestion pipeline to achieve over 230 million records per second in production. We’ll walk through its architectural evolution, the key optimizations that made this possible, and how we leveraged advanced features from our dependencies. We’ll also share our strategies for tuning various workload types and outline our plans for future enhancements. Many of the optimizations we adopted are broadly applicable, and we hope readers can adapt these techniques to improve efficiency and performance in their own distributed systems.
Key use cases & optimizations
The diagram above provides a high-level overview of the Venice ingestion pipeline (we recommend reading our most recent Venice blog post here to become familiar with the key terminology that will be referenced throughout this post). Venice store owners can write to their stores via either bulk load jobs on an offline processing platform or nearline writes from a streaming processing platform or direct write from online applications. All writes pass through an intermediate PubSub broker layer, from which the Venice Storage Node consumes and persists the data locally using RocksDB, an embedded key-value store.
In the following section, we’ll outline the challenges encountered while supporting diverse workloads and delve into each component to explain the optimizations implemented so far.
Use Case 1: bootstrapping from offline dataset
Venice users can run bulk load jobs using offline processing platforms such as Spark to push new data versions to their Venice stores. However, a key challenge in this process is optimizing performance for large or massive stores. To identify bottlenecks, it's important to understand the ingestion pipeline in detail.
Venice Push Job (VPJ) creates a new version topic corresponding to the new store version with several partitions. The Spark job leverages a map-reduce framework to produce data messages to the version topic, maintaining a reducer per topic partition to preserve message ordering.
On the consumption side, the Venice Storage Node (VSN) spins up consumers to read these messages and persists them locally using RocksDB, with one RocksDB instance per topic partition.
Based on operational experience in production, bottlenecks can arise at any of the following stages: producing, consuming, and persisting. Here are some of the strategies we’ve adopted to address these challenges:
- Improving producing/consuming throughput
To increase throughput during data producing/consuming, we typically increase the number of partitions for large stores to better utilize the capacity of the PubSub cluster. However, higher partition count comes with a cost – more partitions introduce management overhead across both the Venice and PubSub platforms and there’s a throughput ceiling per PubSub broker.
- Enhancing consumption scalability
To ensure the Venice Storage Node can keep up with data production, we use shared consumer pools across all hosted stores. Each store version can leverage multiple consumers by distributing hosted partitions among them, so that it can keep multiple connections per pubsub broker to speed up the consumption (similar idea to Download Manager), and in the meantime, the pool approach places an upper limit of the total number of consumers to restrict the overall cost.
- Optimizing I/O performance
To fully utilize local SSD I/O capacity, the VSN uses a shared writer pool to persist changes concurrently across multiple RocksDB instances. Since the ordering is critical in Venice, for a given RocksDB instance, there will be only one writer actively writing to it.
- Minimizing memory overhead
Since messages for a given partition are strictly ordered (thanks to the map-reduce framework), we use RocksDB’s SSTFileWriter to generate SST files directly, significantly reducing memory overhead during ingestion.
The diagram below illustrates the optimized ingestion pipeline incorporating the strategies outlined above.
Use Case 2: hybrid store
Venice is a derived data platform that supports Lambda architecture use cases, enabling it to seamlessly merge updates from both bulk loads and nearline writes. This allows Venice users to access a unified, up-to-date view of their data by querying a single store.
As shown in the diagram above, each bulk load generates a new store version, which includes a corresponding new Kafka topic and database instance. Real-time updates produced by the Samza job via real-time topic are appended to both version topics to keep them current. Once the new version has fully caught up with all updates, it is swapped in as the active version to handle read traffic.
The Venice hybrid store plays a crucial role within the Venice ecosystem, but it also brought new challenges. While bulk load and nearline updates share similar characteristics in the initial stages of the ingestion pipeline, the key distinction is that the Venice hybrid store transitions the database from a read-only mode to read-write.
In RocksDB, when incoming data contains duplicates, meaning some keys are updated or deleted after their initial insertion, log compaction is used to remove stale entries. However, log compaction incurs overhead since it must periodically scan, merge, and rewrite SST files to disk, consuming CPU, I/O, and disk resources. Therefore, the key optimization for the Venice Hybrid Store focuses on tuning RocksDB to balance read amplification, write amplification, and space amplification effectively:
- “Write amplification” is the ratio of bytes written to storage versus bytes written to the database.
- “Read amplification” is the number of disk reads per query.
- “Space amplification” is the ratio of the size of database files on disk to data size.
To balance the trade-offs, Venice uses the default leveled compaction strategy and primarily relies on two methods to balance the trade-offs among the various amplification factors.
1. Tuning the compaction trigger:
level0_file_num_compaction_trigger. This configuration controls the maximum number of files allowed in Level-0 of RocksDB. When the number of files exceeds this threshold, compaction is triggered to push SST files from Level-0 up to Level-1 and further up the levels when upper levels become full.
A higher threshold means more files accumulate in Level-0, resulting in lower write amplification due to fewer compactions. However, this also increases read amplification because reads may need to scan multiple Level-0 files, especially when Bloom filters can’t conclusively determine if a file contains the requested key. Additionally, space amplification rises since more duplicate entries remain due to less frequent compaction.
In practice, we tune this threshold on a per-cluster basis, as different clusters experience different bottlenecks. For example: memory-serving clusters aim to keep all data in RAM to speed up lookups. Because memory is the limiting resource, we set a lower threshold to reduce space amplification. Also, disk-serving clusters are typically limited by disk I/O, so we use a higher threshold to reduce the frequency of compactions and lower the disk write rate.
2. RocksDB BlobDB integration
The following statement is quoted from RocksDB official doc: BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the WiscKey paper, is key-value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction. This reduces write amplification, which has several potential benefits like improved SSD lifetime, and better write and read performance. On the other hand, this comes with the cost of some space amplification due to the presence of blobs that are no longer referenced by the LSM tree, which have to be garbage collected.
Integrating BlobDB with Venice has significantly reduced write amplification in Venice’s multi-tenant clusters, especially for large-value use cases where write amplification had become the primary bottleneck, and we saw a more than 50% reduction of disk write throughput. Previously, we had to scale out clusters despite having available CPU and storage space.
Use Case 3: Active/active replication with partial update
Venice, as a derived data platform, guarantees eventual consistency, which differs from traditional key-value stores that typically provide strong consistency. This means Venice users cannot perform read-modify-write operations directly due to inherent write delays. To address this, Venice introduces a specialized operation called partial update, enabling users to perform field-level updates and collection merges efficiently.
Within the Venice server, when handling partial updates, the leader replica executes a series of operations to ensure that all replicas maintain the same data copy, preserving eventual consistency across the whole cluster.
The diagram above illustrates how the leader replica processes real-time partial updates and replicates the merged results to follower replicas, ensuring eventual consistency.
Handling partial update workloads requires significantly more effort from the Venice leader replica. It must decode the incoming payload, apply the update, re-encode the data, and then write it to both the local database and the Version Topic. Most of these operations are CPU-intensive.
Later, we introduced a new workload type called active/active replication to achieve eventual consistency across multiple data centers. The key technology behind this is deterministic conflict resolution (DCR), which is similar to CRDTs. Essentially, Venice tracks update timestamps at the row and field levels, comparing incoming update timestamps with existing ones to determine whether an update should be applied or skipped.
The architecture has evolved to support active/active replication, requiring the Venice leader replica to perform additional work for DCR. This involves timestamp metadata lookup, decoding, and encoding. To improve CPU efficiency, we have implemented the following optimizations:
1. Fast-Avro adoption
Fast-Avro was originally developed by RTBHouse, but as LinkedIn required several improvements, we took over its maintenance and adopted it under the LinkedIn namespace, introducing numerous optimizations. At a high level, Fast-Avro offers an alternative to Apache Avro for serialization and deserialization by leveraging runtime code generation. This approach delivers significantly better performance compared to the native implementation. Today, the library is widely adopted within LinkedIn due to its ability to support multiple Avro versions at runtime and much better performance.
The Venice platform has fully integrated Fast-Avro, and in one major use case, we observed up to a 90% improvement in deserialization latency at the p99 percentile on the application side, making it an essential technology within the Venice ecosystem.
2. Parallel Processing
As highlighted in the diagram above, most operations involved in handling DCR and partial updates are CPU-intensive. In the traditional pipeline, these operations were executed sequentially, record by record, within the same partition, leading to significant CPU underutilization. To address this, we introduced parallel processing to handle multiple records concurrently within the same partition before producing them to the version topic—while still preserving strict ordering in the final step.
This approach has significantly improved the write throughput for these complex records.
Use Case 4: Active/active replication with deterministic write latency
At the end of the day, users want to minimize write latency as much as possible, even when operating under an eventual consistency model. Providing more deterministic write latency has therefore become increasingly critical.
Venice is a versioned platform that can ingest backup, current, and future versions concurrently within a single Venice server instance. However, in practice, only the current version serves read traffic, and deterministic write latency guarantees are primarily focused on the current version.
To meet these expectations, Venice has implemented a pooling strategy in the ingestion pipeline that assigns different priorities to various workload types. Venice consumer is the first phase of the ingestion pipeline in Venice server and controlling the polling rate by different pools can effectively achieve the desired ingestion prioritization.
We group workloads into a few broad priority tiers based on how demanding they are and how critical their data is. At the top are active/active and partial update workloads for the current version on the leader replica—these tend to be both CPU-intensive and latency-sensitive. Right behind them are other workload types that also target the current version. After that come active/active or partial update workloads for backup or future versions on the leader replica, and finally, everything else falls into a lower-priority bucket.
This setup helps us strike a balance between performance and efficiency. By isolating the most CPU-heavy workloads, we can prevent them from slowing down lighter ones. Prioritizing ingestion for the current version keeps the most up-to-date data flowing smoothly. And by keeping the number of resource pools limited, we avoid overcomplicating resource management or wasting CPU cycles on underutilized pools.
That said, tuning these pools isn’t straightforward. Different clusters see very different kinds of workloads, and even within the same cluster, stores can vary a lot in behavior. Throughput can swing widely over time, and system load tends to shift throughout the day as read traffic ebbs and flows. All of this makes finding the right balance between pools a constant, evolving challenge.
Relying on static configurations forces tuning for the worst-case scenario, which often leads to resource underutilization during normal operations. To address this, we introduced adaptive throttling — a dynamic approach that adjusts ingestion behavior based on recent system performance.
When the system operates within agreed SLAs, ingestion rates are adjusted according to predefined priorities. If an SLA is violated, ingestion rates are immediately throttled back.
Defining clear SLA metrics is crucial for reliable production performance. Currently, Venice focuses on two key criteria:
- Read Latency SLA: This is the highest priority. We never want to violate read latency SLAs, even if it means sacrificing ingestion throughput.
- Write Latency SLA for the Current Version: While read latency SLAs are met, write latency for the current version becomes the top priority. Venice then proportionally tunes the pools to maximize resource utilization and throughput.
Venice scale at LinkedIn
Thanks to these optimizations, the Venice platform at LinkedIn now handles over 175 million key lookups per second and 230 million writes per second, all while maintaining a write latency SLA of under 10 minutes.
Learning and future plans
In addition to the major optimizations discussed in this blog, the Venice team continuously applies general tuning improvements across the entire ecosystem. These include upgrading the JDK version, implementing object reuse and pooling, enhancing code efficiency, and more.
One of the ongoing works in Venice is building a replica-level checkpointing mechanism: peer-to-peer blob transfer. When a new Venice server host joins the cluster, it will not need to reconsume all the messages from PubSub broker to build all the assigned replicas, but rather it will be able to communicate with all the peers in the cluster and ask to transfer the RocksDB databases directly. With this feature, Venice Server can reduce a large amount of unnecessary consumption and processing time and restore the serving state in a more efficient and predictable way.
Performance optimization is an ongoing, iterative process that enables us to balance trade-offs effectively and achieve the desired performance outcomes.
Performance optimization is never truly complete. Despite the strategies already in place, there is still much work to be done. Our goal is to make tuning more self-adaptive to varying workload types, further enhancing operational efficiency.
Acknowledgements
Brainstorming and verifying these performance tuning strategies has been a team effort. Here we would like to thank my colleagues: Felix GV, Hao Xu, Lei Lu, Manoj Nagarajan, Min Huang, Nisarg Thakkar, Sourav Maji, Sushant Mane, Xun Yin, Zac Policzer for their invaluable contributions and feedback. Thanks to the leadership team for supporting our efforts: Manu Jose, Song Lu and Laura Chen.
Finally, thanks to our incredible SREs who keep the site up and running: Ali Poursamadi, Gregory Walton, Michael Askndafi, Namitha Vijayakumar, Ran Wang and Dan White.
Related articles