AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

Building Production
Platform for
Large-Scale
Recommendation
Applications
Xu Ning
Snap

About Me
● Director of Engineering, ML Platform at Snap
● (prev.) Uber Michelangelo ML Platform, Horovod Project
● (prev.) big data and infrastructure at Uber, Facebook, Akamai, Microsoft
Bing

Recommendation applications examples
Search and Ads Short Videos
Feeds

Example architecture of recommendation
systems
“Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
100 millions
thousands
10s of thousands
hundreds
a pageful
Approximate nearest
neighbor search
aka “vector search”
Two towers, dot product
Wide-and-deep,
DeepFM, DCN, DLRM,
Transformers
Rule-based
List-wise LTR

Example architecture of recommendation
systems
“Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Multiple ranking paths
compete at auction

Example recommendation models
1. “Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
2. “Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Light ranking “L1” Heavy ranking “L2”

Unique technical challenges in
recommendation systems
● Data intensive
● Large model size and freshness
● High fanout inference

Volume
● DeepSeek V3 trained with 14.6 Trillion Tokens =~60 TB
● Recommendation model at a Snap trained with 1PB data (and continue to be
incrementally trained over time)
● Typically 1-epoch training to prevent overﬁtting
Variety
● Types: counter, categorical, ID, ID list, embeddings, sequence (array of objects)
● Aggregation dimensions: by entity, by cohort, by category, etc
Velocity
● Trillions of events processed per day in feature pipelines
● Event->available for serving in minutes
RecSys is data intensive
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025

Example: Snap’s Robusta real-time feature
platform
“Speed Up Feature Engineering for Recommendation Systems”, Snap Eng Blog, 9/29/2022

Model size: “Scaling law” before it became a
buzzword popularized by LLMs
Meta’s recommendation model, 2024
Meta’s LLaMa 3.1 405b LLM, 2024
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021

Training large RecSys models
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
99% of weights
DeepFM

How fresh is fresh enough?
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022

High Fanout Inference
Compiling model for inference: User feature broadcast
● Train: (user_feature, document_feature) →label
● Inference: user_feature, [(document_feature)]
○ Need to broadcast user_feature at model compilation or inference server
Document feature fetching
● Each request may need to fetch 10s of 000s document features
○ 1TB/s read volume
Externalized Embedding serving
● 1TB model–cannot ﬁt in memory
● In memory database/serving parameter server

Inference and online feature fetching for
RecSys

Closing words
● Recommendation systems have unique platform technology and operational
challenges due to scale, and complexity.
● It’s highly customized, and there is no clear cloud/open-source OOTB solution
at scale.
○ Kuaishou Persia (unmaintained), ByteDance Monolith (unmaintained)
○ Very challenging to adopt
● More on how Snap powers its recommendation applications:
https://eng.snap.com/introducing-bento
🍱

Snap ML Platform is hiring!
● Senior Principal Machine Learning Engineer, ML Platform
● Principal Machine Learning Engineer, ML Training Platform
● Principal Machine Learning Engineer, ML Inference Platform
● Principal Software Engineer, Machine Learning Infrastructure
● Manager, Software Engineering, Machine Learning Infrastructure, AI Training Platform
● Manager, Software Engineering, Full Stack
● Machine Learning Engineer, 5+ Years Experience
● Machine Learning Engineer, 3+ Years of Experience
● Staff Machine Learning Engineer, 8+ Years of Experience
● Staff Software Engineer, ML Infrastructure, 9+ Years of Experience
● Software Engineer, ML Infrastructure, 6+ Years of Experience
● Software Engineer, ML Infrastructure, 2+ Years of Experience
https://careers.snap.com/

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

More Related Content

More from Alluxio, Inc.

Recently uploaded

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications