Building the incremental and online training platform at LinkedIn

Chen Zhu

October 14, 2025

Co-authors: Co-authored byChen Zhu, Co-authored byTao Huang, Co-authored byLijun Tang, Co-authored byYujie A., Co-authored byYang Pei, Co-authored byRatnadeep Mangalvedhekar, Co-authored byMing-Ju Valentine Lin, Co-authored byYun Zhang, and Co-authored byHui W.

The engineering and product teams at LinkedIn are dedicated to making sure that every interaction a member or customer has with our platform feels responsive and valuable. Incremental and online training systems help us achieve this by taking the professional interactions happening across LinkedIn and improving the personalized recommendation models that lead to more relevant posts, ads, jobs and more being surfaced.

In the past year, we’ve made significant strides in developing and deploying incremental and online training platforms to support our major AI use cases, including Feed, Ads, and “Jobs you might be interested in.” The result has been significant metric gains in core areas that we monitor, such as:

A >2% increase in Total Professional Interactions for Feed, indicating user reaction to daily feed posts
A >2% increase in Total Qualified Applications, indicating Jobs recommendation quality
And a >4% increase in Click-Through-Rate for Ads, indicating relevant ads

In this post, you will learn how we designed and scaled these systems in production, the key challenges and trade-offs we faced, and the engineering best practices that emerged. We’ll also share insights into areas where we see future potential for even faster and more adaptive recommendation systems.

The power of incremental and online training

Incremental and online training facilitates rapid model updates using only the most recent data and latest trained model. Compared with cold start (bulk) training, incremental and online training runs shift from an ad-hoc or one-time run to recurrent executions with output models that depend on previously trained models in past executions.

Figure 1: Illustration of incremental training vs cold-start training

The benefits of incremental and online training include but are not limited to:

Cost savings: Created through optimized resource usage, such as refraining from training on redundant datasets from previous days. This approach can yield approximately X-fold resource savings, where X is determined by dividing the time window length for bulk training by the time window length per incremental iteration. Our benchmark results demonstrate the potential for an 8.9x cost reduction compared to conventional training for our Ads pipeline.
Model freshness: The ability to train and update the model from a daily cadence to hourly / minutes cadence.
Large dataset training with fault tolerance: For large dataset training, enabling incremental training helps avoid the need to restart the entire job when large-dataset training fails in an intermediate state.
Feature consistency: Using the nearline generated training data to guarantee the feature is consistent during inference time and training time.

While these benefits highlight why incremental and online training are attractive, realizing them in practice requires more than just training logic changes. The transition from traditional batch workflows to low-latency, continuously running pipelines introduces new operational, data, and infrastructure complexities. In other words, achieving cost savings, fresher models, fault-tolerant large-scale training, and consistent features at inference time is not automatic. Below we outline the key engineering challenges and how we made these benefits a reality.

Engineering challenges

Although the idea of incremental and online training is simple, the major challenge lies in the production stage, where we aim to automate the end-to-end training experience. Below are the core challenges we faced during our development of these systems.

Prerequisite: Nearline training data generation

Feature inconsistency among inference vs. training is a very common issue that emerges for offline prepared data. When the feature is prepared offline, it might have already been updated after the activities are logged as training data. This makes it necessary to refactor the training data generation pipeline to log the features used during inference and to use our nearline data processing engine (Flink) to scale out training data generation.

Streaming data ingestion

There are no standard out-of-box libraries to directly load streaming data Tensor representations in ML Frameworks (e.g. Pytorch, TensorFlow). For online training platforms, it is necessary to manage how to serialize/deserialize the tensor representation from Kafka messages, and scale up the reading throughput to catch up with the latest training events.

Fragile runtime environment

In a regular bulk training scenario, AI engineers usually run training jobs based on custom training loops with their own python packages for model development. This approach works well when users need full flexibility for training. However, in incremental and online training workflows, a scheduled or continuous-running workflow is needed so that infrastructure engineers can quickly identify and resolve failures within limited response time in production. This means the incremental training platform needs to be able to isolate user and infrastructure code, while also standardizing training loops with centralized dependency management and monitoring systems.

State management

In incremental and online training scenarios, the system not only needs to be able to load and save model checkpoints, but also track the metrics and data partitions that have or have not been used. For example, if a corrupted checkpoint is generated for some reason, the system needs to recover from the previous state and catch up with the unread data.

Monitoring and alerting

With frequent model updates and many interconnected system components, monitoring is both critical and complex. Incremental training systems must guard against gradual drift and imbalanced label distributions, which can silently degrade model quality over time. Maintaining continuous visibility into golden signals – such as loss, AUC, and accuracy – as well as key business metrics is essential for catching subtle regressions early. This is made more challenging by the dynamic nature of streaming data, the distributed nature of training workloads, and evolving data schemas. All of which can introduce silent failures or inconsistencies.

Post-training calibration

Model calibration is the process of adjusting model scores to better reflect real-world probabilities. For example, in a well-calibrated model, a predicted click-through rate (pCTR) of 0.7 should correspond to an empirical 70% click rate. This is especially critical in applications like Ads auctions and bidding, where decisions rely not just on relative ranking but on the absolute accuracy of probabilities.

However, calibration presents unique challenges in an online training context. Traditional methods like Isotonic Regression are batch-based and computationally intensive, taking hours to run—diminishing the real-time benefits of online learning. This creates a need for a solution to avoid running synchronous offline calibration after training finishes.

Architecture overview

As a foundational element of a unified AI training platform, we designed our incremental and online training system with three key principles in mind:

Extensibility: Support for multiple ML frameworks within LinkedIn (e.g. TensorFlow, PyTorch) and use cases (offline and online training) to ensure future-readiness and extensibility.
Observability: Seamless access to metrics, performance insights, and error diagnostics via integrated dashboards and debugging tools.
Sustainability: Software that is robust against failure, and easy to alert and recover from previous states under system failure.

Figure 2: Overview of the incremental and online training platform

Grounded in these principles, the platform’s architecture brings together multiple LinkedIn services to deliver an end-to-end incremental and online training platform:

Kubernetes: For scheduling and service management, including queue management, scheduling, etc.
OpenConnect: Our next generation AI pipeline platform in charge of workflow management. The system is built on top of Flyte, an OSS workflow manager tool.
Kafka (XInfra): An open-source-based message storage and delivery system for streaming training data. It supports message offset management and event delivery with at-least-once guarantee.
Flink: An open-source-based system to support large scale cross-cluster stream data processing.
Spark: An open-source-based system for data Extraction, Transformation, and Loading (ETL).
OpenHouse: An in-house system with offline data access API to get the offline training data.
HDFS: An open-source based storage layer of offline training data.
MLFlow: An open-source based metrics tracking platform that supports real-time metrics reporting and dashboards.
Ambry: The backbone of blob storage for the model checkpoints and version management.
Model Cloud: An in-house model serving platform.
Airflow: An open-source-based orchestration platform to manage automated data pipeline.

These components form the foundation of LinkedIn’s end-to-end machine learning training and serving ecosystem. Each service contributes a specific capability—whether it’s real-time data streaming, workflow orchestration, large-scale storage, or model deployment—and their tight integration enables us to move data efficiently from raw events to production-ready models.

Building on this infrastructure, our training platform executes a series of coordinated steps to prepare data, train models, manage artifacts, and serve predictions at scale. The overall workflow can be broken down into the following steps, which we go into further details on in the next section:

Users prepare the training data for model training (through Spark or Flink)
Users export a serialized training model with static compute graph
Users launch the training component and provide the offline dataset (from OpenHouse) or streaming dataset (from Kafka and nearline training data generation)
As the training component starts, we scale out the data reading across multiple data ingestion workers (or optionally, load a transformation model bundle and run last-mile transformation)
We execute distributed training via a user provided training compute graph
We save checkpoints periodically together with stateful data loading information
We wrap the trained model for serving, and publish towards model cloud

Core Components

Nearline training data generation

Figure 3: Overview of streaming training data generation pipeline

To enable online model training and online-offline feature consistency, we built the nearline feature attribution generator—a fully in-house streaming application developed using Apache Samza, Apache Beam, and Couchbase. Its primary role is to generate fully attributed, enriched event streams by joining a wide range of behavioral and system signals. The application ingests several Kafka streams, including front-end user engagement events (views, clicks, applies), back-end signals (channel decisions, model scores), and metadata such as ranking features. While each stream offers only partial insight, combining them yields a comprehensive, event-level view that forms the foundation for online feature stores, model training, and experimentation. The system currently handles 100% of feature tracking events (which includes feature values used during ranking) across all models, maintaining 99.99% availability.

The application follows the Beam programming model, where each event flows through a well-defined pipeline: Read, Filter, Explode, Transform, Join, Transform, and Write. The join stage supports a 1-hour join window and performs a stateful many-to-many join across the input streams. To keep memory usage efficient, large payloads are offloaded to Couchbase early and rehydrated just in time, reducing Beam state size by over 60%. The system consistently processes 30k–35k events per second, with an end-to-end latency under 5 ms and a join error rate of less than 1%. The enriched events are written back to Kafka and serve as real-time input to downstream ML and analytics systems.

The second step in the pipeline is the Flink-based data transformation service, which consumes the attributed events and prepares them for model training. It converts them into the format required by our online training infrastructure and performs negative downsampling (typically at a 100:1 ratio) to ensure label balance and reduce bias in learning.

This two-step architecture—decoupling attribution from transformation—allows teams to evolve independently while maintaining high throughput, low latency, and statistically sound training data. Together, they power continuous online learning, live experimentation, and personalized recommendations at scale without relying on batch workflows.

Streaming data ingestion

After the streaming data is prepared, we ingest it into our model training framework. In order to achieve the same kind of performance for streaming data compared with pre-existing offline data loaders, we created a ray-based streaming data ingestion library that is utilizing Ray to scale out the reading throughput from ~10 record per second to ~12,000 messages per second. Our library handles the data parsing, feature metadata extraction and tensor conversion to corresponding ML frameworks (e.g.. Pytorch, TensorFlow).

Static training graph execution

After the training dataset is converted to the desired format, a continuous and recurrent training service takes place to update the model parameters. Specifically, in order to resolve the “Fragile Runtime Environment” challenge mentioned in the challenges section, we take the approach of accepting a static training graph as input instead of user code. A “training graph” is a serialized computation graph - backed by a tf function for TF and torch.export for pytorch. Unlike a “serving graph,” which only includes “forward pass” used for model serving, a training graph has all of the computation including forward pass, loss computation, backward pass, and optimizer state update.

The training graph is used for the next incremental training or continuously for online training. We are able to directly load and execute the user's training graph to update model parameters.

Compared to users manually loading models and writing their incremental and online training loops, a training graph provides the following benefits:

Export once, run everywhere: A user's custom dependencies can introduce unintended overhead and can be challenging to manage. With the static training graph, we no longer need the user's custom python dependency to build the model and bootstrap training. The serialized graph can be loaded in our centralized runtime image where tensorflow or pytorch is installed. The graph can also be loaded in a serving environment (e.g. tf serving, torch AOTInductor) within c++ environment.
Fault tolerant: For a deployed incremental training workflow, we don’t allow users to modify the model structure and training loop, which is hard to trace back and might cause production issues.
Isolation: In legacy solutions, we were relying on the user inheriting the model base class to run training in a custom environment. The default implementation was tangled with the user's implementation, which was very hard to debug and root-cause issues. That might be fine to run in a cold-start setting, but in incremental and online training, the troubleshooting can be very time sensitive, and we don’t want to add operational burden under this setting.
Simplified onboarding: Users no longer need to write custom training loops. They only provide configuration inputs (e.g. dataset path, batch size, epochs, etc.), and the system handles execution and model updates automatically.

The training service takes data from the data ingestion service and executes training graphs via corresponding ML framework (e.g. TF / Pytorch runtime). We also handle distributed training under different distributed training frameworks (e.g. Horovod for TensorFlow, Ray Train + DDP/FSDP/HSDP for Pytorch).

State management

Incremental and online training is a stateful service, which means we need a mechanism to save and recover across multiple executions or to handle system disruptions. The saved state includes the following information:

Model checkpoint: A TF saved model or torch exported model including model parameters and optimizer states.
Training metadata: Training metadata includes keeping track of the following items.
- Data loading status: For an offline dataset, we keep track of offline date partitions we have loaded, and the data loader only loads from latest data that has not been trained before. For a streaming dataset, we keep track of the timestamp and partition offset from Kafka.
- Model lineage: Information about the upstream parent model or last checkpoint the model was trained on.
- Model metrics: The model metrics from real-time validation or offline bulk validation.

Monitoring and alerting

We built a high-level metrics library that standardizes metric emission across all components in the training stack. The library attaches rich default dimensions (e.g. fabric, Flyte/Kubernetes metadata) to ensure consistent tagging and attribution. It supports multi-process environments – such as distributed training with Horovod or Ray – by aggregating metrics via Prometheus’ multiprocess mode. This allows all metrics to be collected and served through a centralized endpoint, ultimately feeding into LinkedIn’s observability platform. The abstraction streamlines integration, prevents metric fragmentation, and ensures critical signals (e.g. training throughput, data volume, latency, and freshness) are captured reliably.

The emitted metrics are sent to our centralized metrics dashboard for monitoring and alerting:

Figure 5: Example dashboard to showcase label distribution in real time for job applications

Model calibration

To address the model calibration challenge, we built a calibration pipeline that allows online-trained models to reuse periodically updated calibration mappings. Optimizations like parallelized calibration were also introduced to significantly reduce calibration training runtime.

Furthermore, we improved calibration quality by adopting Nearline Feature Attribution as the calibration data source. Unlike previous approaches that relied on stratified sampling to approximate online traffic distribution, this data source directly captures the actual serving distribution. This led to measurable gains in daily calibration performance, with O/E ratios (observed-to-expected) at top-ranked positions improving from 0.65±0.1 to 1.0±0.1.

Figure 6: Model prediction histogram and reliability diagram. Left (in blue) is raw model scores and right (in orange) is calibrated.

Component CI/CD

To manage the component releases with more automation and safety, we leveraged Github Actions and Flyte launchplan with integration test workflow to manage CI/CD processes. The workflow follows the below:

For each new commit of the library, a set of core integration tests is executed on the temporarily refreshed runtime image built in the post-merge step with the latest changed code. This process helps to identify and prevent component-level code regression issues.
The new version is automatically deployed and published to production Flyte environments as a new version of the Flyte launch plan. By default, the code project version remains in the release candidate state until our developer team designates it as active.

Accomplishments

As mentioned earlier, our incremental and online training platform has been adopted by several major AI vertical use cases. Some of those use cases include:

”Jobs you might be interested in” L2 model: We have onboarded the Jymbii L2 model to a co-operated incremental and online training pipeline. The results have been >2% Total Qualified Applications over our baseline model, with a >2% decrease in dismisses.
Ads incremental training: We have onboarded Ads offline CTR model training to incremental training and see >4% in Click-Through-Rates.
Feed Second Pass Ranking (SPR) incremental training: we have onboarded the Feed SPR model with a direct impact of >0.05% Daily Active Users, Site Wide Impact (SWI) >0.3% Onsite Revenue, >2% Total Professional Interactions and >0.1% Total Macrossessions.

Future directions

Looking ahead, we want to expand our capabilities and further enrich our platform to become standard production-level training solution:

Support for generative recommendation models and LLM-based recommendations: Our platform currently supports point-wise training, where it is based on the user's activity event. However, with the evolution of the generative model, we see potential for supporting the incremental training of sequential models.
Support for complex training scheme: The data->training->serving scheme applies well to regular supervised learning. With the evolution of the large language model, some post-training schemes like knowledge distillation and reinforcement learning have come into play. We will look to extend our system to include different types of services coordinated with each other (e.g teacher and student model training) to support more online training solutions.
Auto-optimization for training performance: We are exploring the incorporation of compute engine optimizations like graph fusion, hardware planning, and model-specific template-matching for continuous performance enhancement, so that users will be free from system level configuration.

Conclusion

Incremental training is more than a technical solution—it is a transformative approach to evolving AI models. By refining core principles, addressing challenges, and driving innovation, we have strengthened LinkedIn’s training platform to support scalable, data-driven AI experiments. Looking ahead, our focus remains on equipping teams with the tools to unlock the full potential of their models and on making incremental training a standard practice across all of LinkedIn’s AI solutions.

Acknowledgement

It has been incredible journey to build end-to-end solution from 0 to 1, thanks to the core contributor: Chen Zhu, Daniel Tang, Ethan Lyu, Haoyue Tang, Hui Wang, Lijuan Zhang, Tao Huang, Valentine Lin, Yang Pei, Yujie Ai, Yun Zhang from Training and Framework Team

The project won’t be successful without getting sibling teams help, including Declan Dboyd, Ratnadeep Mangalvedhekar, Vitaly Abdrashitov, Zachary Escalante, Jianqiang Shen, Xiaoqing Wang, from Talent Marketplace, Ruoyan Wang, Oz Ozkan, Zhiwei Wang (alumni) from Ads AI; Yunbo Ouyang (Alumni), Lei Le from Core AI, Hailing Cheng from Feed AI, Lingyu Lyu, Youmin Han, Biao He (Alumni), Yubo Wang, Wenye Zhang, Santosh Jha, Ankit Goyal, Frank Gu, Chen Xie from OpenConnect team, Yj Yang, Ai Shi, Doug Walther, Arthur Fox, Siddharth Sodhani from MLOps, Sophie Guo from Storage team, Tugrul Bingol, Deepesh Nathani, Minglei Zhu, Vivek Hariharan from Model Serving Team, and Yuchen He, Supriya Madhuram from Metrics Platform Team

Moreover, special thanks for the great support from our leadership: Yitong Zhou, Animesh Singh, Kapil Surlaker, Raghu Hiremagalur

Last but not least, thanks Maneesh Varshney and Benito Leyva for blog reviews and suggestions.

Topics: Artificial intelligence Infrastructure