Building Production
Platform for
Large-Scale
Recommendation
Applications
Xu Ning
Snap
About Me
● Director of Engineering, ML Platform at Snap
● (prev.) Uber Michelangelo ML Platform, Horovod Project
● (prev.) big data and infrastructure at Uber, Facebook, Akamai, Microsoft
Bing
Recommendation applications examples
Search and Ads Short Videos
Feeds
Example architecture of recommendation
systems
“Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
100 millions
thousands
10s of thousands
hundreds
a pageful
Approximate nearest
neighbor search
aka “vector search”
Two towers, dot product
Wide-and-deep,
DeepFM, DCN, DLRM,
Transformers
Rule-based
List-wise LTR
Example architecture of recommendation
systems
“Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Multiple ranking paths
compete at auction
Example recommendation models
1. “Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023
2. “Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022
Light ranking “L1” Heavy ranking “L2”
Unique technical challenges in
recommendation systems
● Data intensive
● Large model size and freshness
● High fanout inference
Volume
● DeepSeek V3 trained with 14.6 Trillion Tokens =~60 TB
● Recommendation model at a Snap trained with 1PB data (and continue to be
incrementally trained over time)
● Typically 1-epoch training to prevent overfitting
Variety
● Types: counter, categorical, ID, ID list, embeddings, sequence (array of objects)
● Aggregation dimensions: by entity, by cohort, by category, etc
Velocity
● Trillions of events processed per day in feature pipelines
● Event->available for serving in minutes
RecSys is data intensive
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
Example: Snap’s Robusta real-time feature
platform
“Speed Up Feature Engineering for Recommendation Systems”, Snap Eng Blog, 9/29/2022
Model size: “Scaling law” before it became a
buzzword popularized by LLMs
Meta’s recommendation model, 2024
Meta’s LLaMa 3.1 405b LLM, 2024
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
Training large RecSys models
“Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
99% of weights
DeepFM
How fresh is fresh enough?
“Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
High Fanout Inference
Compiling model for inference: User feature broadcast
● Train: (user_feature, document_feature) →label
● Inference: user_feature, [(document_feature)]
○ Need to broadcast user_feature at model compilation or inference server
Document feature fetching
● Each request may need to fetch 10s of 000s document features
○ 1TB/s read volume
Externalized Embedding serving
● 1TB model–cannot fit in memory
● In memory database/serving parameter server
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
Inference and online feature fetching for
RecSys
“Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
Closing words
● Recommendation systems have unique platform technology and operational
challenges due to scale, and complexity.
● It’s highly customized, and there is no clear cloud/open-source OOTB solution
at scale.
○ Kuaishou Persia (unmaintained), ByteDance Monolith (unmaintained)
○ Very challenging to adopt
● More on how Snap powers its recommendation applications:
https://eng.snap.com/introducing-bento
🍱
Snap ML Platform is hiring!
● Senior Principal Machine Learning Engineer, ML Platform
● Principal Machine Learning Engineer, ML Training Platform
● Principal Machine Learning Engineer, ML Inference Platform
● Principal Software Engineer, Machine Learning Infrastructure
● Manager, Software Engineering, Machine Learning Infrastructure, AI Training Platform
● Manager, Software Engineering, Full Stack
● Machine Learning Engineer, 5+ Years Experience
● Machine Learning Engineer, 3+ Years of Experience
● Staff Machine Learning Engineer, 8+ Years of Experience
● Staff Software Engineer, ML Infrastructure, 9+ Years of Experience
● Software Engineer, ML Infrastructure, 6+ Years of Experience
● Software Engineer, ML Infrastructure, 2+ Years of Experience
https://careers.snap.com/

AI/ML Infra Meetup | Building Production Platform for Large-Scale Recommendation Applications

  • 1.
  • 2.
    About Me ● Directorof Engineering, ML Platform at Snap ● (prev.) Uber Michelangelo ML Platform, Horovod Project ● (prev.) big data and infrastructure at Uber, Facebook, Akamai, Microsoft Bing
  • 3.
  • 4.
    Example architecture ofrecommendation systems “Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023 100 millions thousands 10s of thousands hundreds a pageful Approximate nearest neighbor search aka “vector search” Two towers, dot product Wide-and-deep, DeepFM, DCN, DLRM, Transformers Rule-based List-wise LTR
  • 5.
    Example architecture ofrecommendation systems “Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022 Multiple ranking paths compete at auction
  • 6.
    Example recommendation models 1.“Embedding-based Retrieval with Two-Tower Models in Spotlight”, Snap Eng Blog, 6/6/2023 2. “Machine Learning for Snapchat Ad Ranking”, Snap Eng Blog, 2/11/2022 Light ranking “L1” Heavy ranking “L2”
  • 7.
    Unique technical challengesin recommendation systems ● Data intensive ● Large model size and freshness ● High fanout inference
  • 8.
    Volume ● DeepSeek V3trained with 14.6 Trillion Tokens =~60 TB ● Recommendation model at a Snap trained with 1PB data (and continue to be incrementally trained over time) ● Typically 1-epoch training to prevent overfitting Variety ● Types: counter, categorical, ID, ID list, embeddings, sequence (array of objects) ● Aggregation dimensions: by entity, by cohort, by category, etc Velocity ● Trillions of events processed per day in feature pipelines ● Event->available for serving in minutes RecSys is data intensive “Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
  • 9.
    Example: Snap’s Robustareal-time feature platform “Speed Up Feature Engineering for Recommendation Systems”, Snap Eng Blog, 9/29/2022
  • 10.
    Model size: “Scalinglaw” before it became a buzzword popularized by LLMs Meta’s recommendation model, 2024 Meta’s LLaMa 3.1 405b LLM, 2024 “Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021
  • 11.
    Training large RecSysmodels “Persia: An Open, Hybrid System Scaling Deep Learning-based Recommenders up to 100 Trillion Parameters”, Lian et al, 2021 “Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022 99% of weights DeepFM
  • 12.
    How fresh isfresh enough? “Monolith: Real Time Recommendation System With Collisionless Embedding Table”, Liu et al, 2022
  • 13.
    High Fanout Inference Compilingmodel for inference: User feature broadcast ● Train: (user_feature, document_feature) →label ● Inference: user_feature, [(document_feature)] ○ Need to broadcast user_feature at model compilation or inference server Document feature fetching ● Each request may need to fetch 10s of 000s document features ○ 1TB/s read volume Externalized Embedding serving ● 1TB model–cannot fit in memory ● In memory database/serving parameter server “Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
  • 14.
    Inference and onlinefeature fetching for RecSys “Introducing Bento, Snap's ML Platform”, Snap Eng Blog, 1/28/2025
  • 15.
    Closing words ● Recommendationsystems have unique platform technology and operational challenges due to scale, and complexity. ● It’s highly customized, and there is no clear cloud/open-source OOTB solution at scale. ○ Kuaishou Persia (unmaintained), ByteDance Monolith (unmaintained) ○ Very challenging to adopt ● More on how Snap powers its recommendation applications: https://eng.snap.com/introducing-bento 🍱
  • 16.
    Snap ML Platformis hiring! ● Senior Principal Machine Learning Engineer, ML Platform ● Principal Machine Learning Engineer, ML Training Platform ● Principal Machine Learning Engineer, ML Inference Platform ● Principal Software Engineer, Machine Learning Infrastructure ● Manager, Software Engineering, Machine Learning Infrastructure, AI Training Platform ● Manager, Software Engineering, Full Stack ● Machine Learning Engineer, 5+ Years Experience ● Machine Learning Engineer, 3+ Years of Experience ● Staff Machine Learning Engineer, 8+ Years of Experience ● Staff Software Engineer, ML Infrastructure, 9+ Years of Experience ● Software Engineer, ML Infrastructure, 6+ Years of Experience ● Software Engineer, ML Infrastructure, 2+ Years of Experience https://careers.snap.com/