The Edge of Innovation: Engineering Insights from an Evolving Edge-Building System at LinkedIn
LinkedIn’s professional network is powered by graph technology that connects numerous edges representing relationships and activities across the platform, such as member connections and follows. One crucial element in supporting member network growth is our edge-building system, which recommends relevant and valuable edges to help members build and nurture professional relationships and engage with content aligned with their interests and goals. The system includes a range of AI-powered recommender services, such as:
- People You May Know (PYMK): Helps members discover new connections and expand their network.
- Follows: Suggests relevant influencers, pages, events, and other entities for members to follow and engage with content of interest.
- Catch Up: Supports timely and meaningful interactions to maintain and strengthen existing relationships.
Over the years, we’ve significantly evolved the infrastructure that supports our edge-building system with improvements that deliver more personalized recommendations. Optimizing how edges are recommended enhances both the quality of results and the overall experience for members.
One significant evolution is how model inference is hosted and executed. Model inference is the process of using a trained machine learning model to make predictions or recommendations based on input data such as member activities, profiles, or connections. The design of these inference workflows is influenced by various factors, including the need for real-time feature freshness, system performance requirements, scalability, and cost considerations.
In this blog, we’ll walk through the evolution of inference workflows within our edge-building system - covering offline, nearline, online, and remote approaches - and discuss the trade-offs and factors that have influenced the system’s design.
Inference workflows and edge-building evolution
Despite variations in hosting and execution, our model inference workflows follow a common process:
- Raw data is collected from multiple sources (e.g., member interactions and profiles), and then pre-processed and transformed into input features suitable for the model, often through background pipelines.
- The trained machine learning model applies learned patterns to input features to generate predictions or recommendation scores.
- The results are stored or directly served to the front-end services, depending on whether the inference is conducted in offline, nearline, or online environments. While the model output is consistent for the same input, the freshness of the computed results can vary across these environments, affecting what is delivered to end users.
When members interact with edge-building entry points (e.g., MyNetwork), the system processes their requests in real time. In the early stages of the system's development, most online services within the edge-building system placed top priority on responsiveness. To achieve this, inference results were often pre-generated for quick retrieval. However, two key limitations became increasingly apparent: result staleness and high computational cost. Pre-computed scores could quickly become outdated, reducing the relevance and effectiveness of the recommendations. At the same time, significant resources were spent generating and storing recommendations for members who might never request them. As these trade-offs grew more pronounced, they motivated the system’s evolution toward more dynamic and efficient inference approaches.
In practice, selecting an inference approach often involves balancing factors such as latency, freshness, and cost, since it's rare for a single method to optimize for all factors simultaneously. Depending on product needs, different systems may adopt different inference strategies. We’ll use LinkedIn’s edge-building online system as an example to illustrate how model inference is hosted across various environments in the following sections.
Offline scoring
In the early stages of edge-building design, model inference was executed offline, driven primarily by latency considerations. In this approach, scores were pre-calculated during an offline process and stored in key-value databases, optimized for efficient lookups by online services as needed.
One advantage of offline scoring is that it shifts the computational load away from online services, allowing them to retrieve pre-computed scores quickly and maintain strong performance. However, the pre-computation process could consume massive offline resources, making it highly inefficient and costly.
For instance, in our Follows system's early inference flow - prior to the transition to online scoring - offline storage requirements were 90%+ higher. This was because large amounts of computational and storage resources were inefficiently used on generating recommendations for members who did not log in on a given day. Additionally, even with up-to-date feature data, pre-calculated scores could become outdated before the next offline computation cycle, which might take days to complete. This delay resulted in a suboptimal member experience.
The design choice ultimately depends on the requirements. Offline scoring can be a viable solution for use cases where service performance is critical but data freshness is less of a priority, or when offline computation costs are relatively low. In such scenarios, it reduces the load on online services while maintaining acceptable performance levels, making it a suitable choice for specific use cases.
Nearline scoring
Model inference can also be performed in a nearline environment. For example, in stream processing pipelines, when a message is sent to deliver newly generated recommendations, real-time model inference can be executed immediately. Similarly, when a new feature becomes available, a message can be initiated to trigger model inference and ensure the latest scores are computed.
Nearline scoring strikes a balance between the efficiency of offline scoring and the immediacy of online scoring. By triggering inference in response to events, nearline scoring can provide fresher scores without overloading the online services with constant real-time processing. This leads to more timely updates compared to traditional offline methods, and it can also enable systems to adapt quickly to changes in user behavior or newly available features. That being said, managing nearline scoring presents its own challenges, particularly in handling processing delays. When the waiting queue of the update event becomes long, the scoring process lags and leads to outdated scores.
Our Catch Up system was an example that implemented nearline scoring. However, in its original design, the potential of nearline scoring to address score staleness was underutilized. That’s because unseen candidates were not rescored in response to feature updates or time decay effects, leaving considerable room for improvement. Given the high volume of data traffic and associated rescoring costs, we explored more scalable alternatives. Still, if the feature pipeline is enhanced to support real-time updates and trigger rescoring, and the database is scaled for real-time writes, nearline scoring can be an effective solution to consider, particularly in scenarios where feature changes occur over longer intervals rather than within minutes.
Online scoring
Online scoring performs model inference in real-time and is triggered whenever an online service receives a request. This allows scores to be calculated instantly. It provides fresh and relevant results - especially when feature data is up-to-date - and reduces the need for extensive database storage. Scores no longer need to be pre-computed and persisted for every user, allowing the system to serve a broader set of candidates beyond those generated offline. The trade-off is increased latency and potential strain on system resources. As a result, more optimizations need to be done to get the maximum performance out of the system.
To improve recommendation quality and enhance member experience, online model inference has been increasingly adopted in the edge-building system. We implemented several optimizations to maintain performance, including caching feature data and downstream responses, enabling parallel processing, and applying early-stage ranking (e.g., L1 ranking) to narrow down the candidate set before final scoring.
This is another version to further improve freshness of features, by leveraging real-time feature updates:
Adopting online scoring has also enabled us to experiment with advanced online candidate generation techniques such as Embedding Based Retrieval (EBR) which helps surface more relevant and diverse recommendations. For example, when showing Follows recommendations to new members signing up on LinkedIn, we usually do not have any features associated with them in our feature stores and we rely on contextual signals. With EBR, we can instead use a member’s profile information to generate embeddings and retrieve relevant candidates from a vector store based on embedding similarity. These EBR-generated candidates are then merged with results from offline and other online sources. All candidates are scored online and the highest-ranking recommendations are returned to the member.
Remote scoring
The local online scoring implementation runs on-premise and requires significant hardware resources that make it difficult to scale as the model complexity grows. To address this, we migrated the model inference process to a remote/cloud-based solution to improve scalability and reduce operational costs while maintaining the quality of recommendations.
The remote scoring enables the separation of candidate generation and model scoring which offers several benefits:
- Different scaling requirements: Candidate and feature generation and model scoring have different computational and resource requirements. Candidate generation and feature generation focuses on generating a list of potential candidates from a large dataset, which may require more memory and less CPU computation. On the other hand, model scoring involves evaluating the candidates based on specific features and assigning scores. That may require less memory but more complex computations on GPU. By separating these two processes, we can allocate resources more efficiently and optimize each process independently.
- Faster model iteration: Separating candidate generation and model scoring allows for faster model iteration as each process can be updated and improved independently. This means that if a new feature or algorithm is introduced for candidate generation, it can be tested and implemented without affecting the model scoring process. Similarly, if a new scoring method is developed, it can be integrated without impacting the candidate generation process. The independent iteration leads to faster model development and improvement.
- Increased flexibility in model Directed Acyclic Graph (DAG): A DAG represents the dependencies between different tasks in a model pipeline. By separating candidate generation and model scoring, we can create a more flexible DAG that allows for parallel processing and independent optimization of each process. This flexibility enables the development of more complex and efficient model pipelines. It also makes it easier to incorporate new processes or algorithms into the existing pipeline.
Remote scoring offers numerous benefits but also presents several challenges:
- Latency: Remote scoring can introduce additional latency due to the network communication required between local candidate generation and remote model scoring services. This added step may impact overall response time when serving recommendations.
- Operational complexity: Managing and maintaining a remote/cloud-based solution can be more complex compared to an on-premise setup. It requires robust monitoring, logging, and error-handling mechanisms to ensure reliability and performance.
- Dependency on network stability: The performance of remote scoring is highly dependent on network stability and bandwidth. Any network issues can lead to delays or failures in the scoring process, affecting the quality of recommendations.
Comparison
Now that we've reviewed the different inference techniques used in edge-building systems where an online service processes real-time member requests, let’s take a side-by-side look at how they differ in practice.
Future considerations
The evolution of the edge-building system opens up new opportunities for growth and innovation. As we look ahead, the following strategic areas stand out:
- AI productivity and enhanced operability: Investments in training and retraining pipelines, automation of model benchmarking, and interoperability between offline and online feature systems are crucial. These initiatives aim to enhance developer efficiency and improve the quality and speed of model iterations. Mitigating risks through robust testability, better troubleshooting, and automated monitoring will ensure smoother operations across large-scale recommendation systems.
- Cost optimization and lean system design: Scaling recommendation systems to handle increasing queries per second (QPS) year-over-year costs millions of dollars. Optimizations such as GPU fine-tuning, reduced feature fetches, advanced storage compression techniques, high-level parallelization, and the optimal utilization of memory and processing units (GPU/CPU) play a critical role in managing this growth sustainably. A shift towards ROI-driven hardware planning and consolidation of redundant systems will enable more efficient scaling.
- Adoption of new technologies: Advanced AI techniques, including large language models (LLMs), transformer-based architectures, and embedding-based retrieval (EBR), are redefining recommendation systems. For example, EBR paired with LLMs can be used to augment missing profile data and generate embeddings that encode member activities and interests. This deep embedding-based understanding facilitates enhanced candidate generation and personalization. GPUs further enable real-time inference at scale, powering dynamic recommendations like real-time profile completions and high-quality suggestions on-demand, ensuring scalability and efficiency for billions of interactions.
- Cutting-edge modeling techniques for relevance: Incorporating innovative modeling techniques, such as Model-Agnostic Meta-Learning (MAML) for low-resource learning, sequential models for capturing temporal patterns, and graph neural networks (GNNs) for relationship inference, enhances recommendation quality. Additionally, advanced ranking models leveraging LLMs enable nuanced understanding and prioritization of candidates, leading to more precise and engaging suggestions tailored to diverse member needs.
- Preparation for an agentic future: Platforms like LinkedIn will play a pivotal role in an agentic future, where systems proactively assist users in achieving their goals. This involves evolving from passive recommendations to dynamic, context-aware interactions powered by AI. The focus will shift toward delivering fewer but higher-impact suggestions enriched with insights, empowering users to make strategic decisions and build stronger networks. By leveraging advanced models and real-time capabilities, the system can seamlessly align member aspirations with actionable opportunities.
Final thoughts
With optimizations in the edge-building system, we are now able to run more efficient experiments and have seen improved member engagement through A/B testing. There is no one-size-fits-all solution when it comes to system design, as the best approach depends on the specific requirements. Trade-offs should be carefully considered to achieve the desired balance.
As we continue to evolve our system, our focus remains on helping members discover new opportunities and foster meaningful relationships.
Acknowledgements
This journey requires ongoing effort, and we're grateful to everyone across teams who helped us reach the milestones together:
Andrew Hatch, Angelika Clayton, Bixing Yan, Chen Lin, Chen Zhang, Chiachi Lo, Chinmayee Vaidya, Cindy Liang, Da Xu, Haohua Wan, Jugpreet Singh Talwar, Liang Li, Liyan Fang, Nirav Shingala, Piyush Pattanayak, Pratham Makhni Alag, Rakesh Malladi, Rishav Roy Chowdhury, Steven Tsay, Tammy Kong, Yafei Wang, Yafei Wei, Yan Wang, Young Yoon, and Xukai Wang
Special thanks to Bobby Nakamoto and Netra Malagi for their steady support and strategic guidance that made this possible, and to Benito Leyva for the editorial input that helped improve the quality of this post.