AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

Unlock the future of Generative AI:
TorchTitan’s latest breakthroughs
Jiani Wang
Software Engineer, PyTorch, Meta
Some slides credits to our team member Tianyu Liu, Wan Chao Liang
1

Foundation Model Training Before
Usually.. a mess
But Why?
Story from poolside
2

Large Scale Training Challenges
N-D Parallelism needed for
large scale training
Scaling laws: Data/Model too big!
Need many parallelisms to
enable/accelerate model training:
● Data Parallelism: DDP, FSDP
● Model Parallelism: TP/SP, PP
Existing solutions often:
● Too complicated to understand
and compose together
● Model specific, months of
engineering work to enable
● Not ideal for research
explorations!
Training efficiency
● Kernels/Compiler based fusions
● Mixed precision training (i.e. Float8)
● Activation checkpoint
● …
Production ready
● Checkpoint save/load
● Debuggability
● Profiling/Tensorboard/Metrics
● Large model initialization
● …
03
01 02
So many things need to work together for efficient LLM training at scale!
3

TorchTitan: a PyTorch native platform for foundation model training
Composable N-D Parallelism Strategies
● PyTorch native 3D/4D/5D parallelism that
focuses on composability and simplicity
● Decouple model code and infra code (i.e. N-D
parallelism implementation) to allow faster
research explorations
https://github.com/pytorch/torchtitan
source: huggingface blog source: FlashAttention paper source: PyTorch blog
Efficient Training Techniques
● Computation Fusions: torch.compile,
FlashAttention
● Mixed Precision Training recipe:
torchao.float8, bfloat16
● Communication Overlap: Async TP,
zero-bubble PP
Production Ready Training
● Distributed Checkpoint save/load
● Flight recorder debugging
● Metrics via Tensorboard/Wandb, Profiling
● Large model meta device initialization
4

Composable N-D Parallelism Training
● Built-in LLMs (llama 3/4, flux.1,
deepseek v3, Qwen3)
● Efficient Data Loading Solution
● Meta device initialization
● Monitor Training Progress
All the components built are extensible!
End to end model
training pipeline
● Fully Sharded Data Parallel v2
● Tensor Parallel/Sequence Parallel
● Pipeline Parallel
● Context Parallel
● Expert Parallel
PyTorch native parallelisms are composable
with each other, and other training techniques!
Parallelism strategies
Distributed
Checkpoint
Parallelism aware state_dict
Efficient save/load/resharding
async save for performance
5

Scaling with TorchTitan 4D Parallelism
Long Sequence Training
Scaling to thousands of GPUs
with a few knobs
Training Efficient while scaling
Extensible to new model
architecture
Tackle Large Model/Data Size
3
4
2
1
5
Systematic Approach Scaling to Thousands of GPUs
6

Optimizing Training Efficiencies in TorchTitan
Activation Checkpoint (recomputation)
Torch.compile on each TransformerBlock
Fast compile time
Support full graph capture
Mixed Precision Training Recipes
Bﬂoat16 example
Float8 example 7

Highlights in 2025
Low Precision (TorchAO)
04
● MXFP8
● rowwise FP8
Fault tolerance (TorchFT)
03
● Fault Tolerant HSDP
● Semi-Sync Training (LocalSGD, DiLoCo)
Compiler-based Parallelism
02
● SimpleFSDP
● AutoParallel (new repo!)
Extended n-D Parallelism
01
● Context Parallel
● Expert Parallel
8

Models and Scaling
Mix-of-Experts
● DeepSeek-v3
● Qwen3-MoE
● gpt-oss
Diffusion
● FLUX.1
Vision
Language
● SigLIP2-Llama3
Dense
● Llama3
● Qwen3
● N-D parallel for SotA models in each category
9
* Thanks to our collaborators/contributors: Phúc H. Lê Khắc (@lkhphuc), Yasser Dahou (@YasserdahouML), Ankit Singh (@Griﬃntaur),
Antoni-Joan Solergibert (@TJ-Solergibert), Rohan Pandey (@KhoomeiK), Black Forest Labs, MLCommons

Case Study: DeepSeek-V3 Model
Currently applied
● Grouped Experts torch._grouped_mm
● Expert Parallel
● Selective Activation Checkpointing
● NCCL all-to-all for token
dispatch/combine
Future optimizations
● DeepEP-style communication dedup
● DualPipeV pipeline schedule
10

Extension Points: From Example to Framework
TrainSpec
● model
● parallelize fn
● training components
○ tokenizer, data loader
○ optimizer, LR scheduler
○ loss fn, validation
● HF conversion maps
11
ModelConverter
● in-place, runtime model transformations
● quantization / fused implementation
JobConfig
● hierarchical, modularized
● easy extension from base
JobConfig
torchtitan/experiments/
● a middle ground for innovation
○ VLM
○ SimpleFSDP / AutoParallel
○ RL trainer interface
● community contribution is welcome

Ecosystem
Integration with Hugging Face
● adapters between torchtitan <> HF
○ load HF checkpoint
○ publish HF model
○ inference, eval, fine-tune, RL
● Spoiler
○ train HF model natively in torchtitan
torchtitan
Hugging Face
torchforge
12
vLLM
Integration with RL framework
● torchtitan as trainer

Adoptions
Academia
● ICLR 2025 paper
● ICML 2025 ES-FoMo invited talk
● ICML 2025 CODEML workshop
● Numerous papers based on torchtitan
13
Cloud Platforms
● AWS, Crusoe, DataCrunch, IBM,
SkyPilot, Together AI, etc.
Frontier Labs
● e.g. Nous Research, poolside

Next Steps
Training efﬁciency
● MoE communication overlapping and dedup
● Muon and second order optimizers, distributed
RL foundations
● fundamental capabilities
● framework integration
Compiler-based distributed training
● compiler backend optimizations
● AutoParallel integration
14

AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

More Related Content

Similar to AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

More from Alluxio, Inc.

Recently uploaded

AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs