Unlock the future of Generative AI:
TorchTitan’s latest breakthroughs
Jiani Wang
Software Engineer, PyTorch, Meta
Some slides credits to our team member Tianyu Liu, Wan Chao Liang
1
Foundation Model Training Before
Usually.. a mess
But Why?
Story from poolside
2
Large Scale Training Challenges
N-D Parallelism needed for
large scale training
Scaling laws: Data/Model too big!
Need many parallelisms to
enable/accelerate model training:
● Data Parallelism: DDP, FSDP
● Model Parallelism: TP/SP, PP
Existing solutions often:
● Too complicated to understand
and compose together
● Model specific, months of
engineering work to enable
● Not ideal for research
explorations!
Training efficiency
● Kernels/Compiler based fusions
● Mixed precision training (i.e. Float8)
● Activation checkpoint
● …
Production ready
● Checkpoint save/load
● Debuggability
● Profiling/Tensorboard/Metrics
● Large model initialization
● …
03
01 02
So many things need to work together for efficient LLM training at scale!
3
TorchTitan: a PyTorch native platform for foundation model training
Composable N-D Parallelism Strategies
● PyTorch native 3D/4D/5D parallelism that
focuses on composability and simplicity
● Decouple model code and infra code (i.e. N-D
parallelism implementation) to allow faster
research explorations
https://github.com/pytorch/torchtitan
source: huggingface blog source: FlashAttention paper source: PyTorch blog
Efficient Training Techniques
● Computation Fusions: torch.compile,
FlashAttention
● Mixed Precision Training recipe:
torchao.float8, bfloat16
● Communication Overlap: Async TP,
zero-bubble PP
Production Ready Training
● Distributed Checkpoint save/load
● Flight recorder debugging
● Metrics via Tensorboard/Wandb, Profiling
● Large model meta device initialization
4
Composable N-D Parallelism Training
● Built-in LLMs (llama 3/4, flux.1,
deepseek v3, Qwen3)
● Efficient Data Loading Solution
● Meta device initialization
● Monitor Training Progress
All the components built are extensible!
End to end model
training pipeline
● Fully Sharded Data Parallel v2
● Tensor Parallel/Sequence Parallel
● Pipeline Parallel
● Context Parallel
● Expert Parallel
PyTorch native parallelisms are composable
with each other, and other training techniques!
Parallelism strategies
Distributed
Checkpoint
Parallelism aware state_dict
Efficient save/load/resharding
async save for performance
5
Scaling with TorchTitan 4D Parallelism
Long Sequence Training
Scaling to thousands of GPUs
with a few knobs
Training Efficient while scaling
Extensible to new model
architecture
Tackle Large Model/Data Size
3
4
2
1
5
Systematic Approach Scaling to Thousands of GPUs
6
Optimizing Training Efficiencies in TorchTitan
Activation Checkpoint (recomputation)
Torch.compile on each TransformerBlock
Fast compile time
Support full graph capture
Mixed Precision Training Recipes
Bfloat16 example
Float8 example 7
Highlights in 2025
Low Precision (TorchAO)
04
● MXFP8
● rowwise FP8
Fault tolerance (TorchFT)
03
● Fault Tolerant HSDP
● Semi-Sync Training (LocalSGD, DiLoCo)
Compiler-based Parallelism
02
● SimpleFSDP
● AutoParallel (new repo!)
Extended n-D Parallelism
01
● Context Parallel
● Expert Parallel
8
Models and Scaling
Mix-of-Experts
● DeepSeek-v3
● Qwen3-MoE
● gpt-oss
Diffusion
● FLUX.1
Vision
Language
● SigLIP2-Llama3
Dense
● Llama3
● Qwen3
● N-D parallel for SotA models in each category
9
* Thanks to our collaborators/contributors: Phúc H. Lê Khắc (@lkhphuc), Yasser Dahou (@YasserdahouML), Ankit Singh (@Griffintaur),
Antoni-Joan Solergibert (@TJ-Solergibert), Rohan Pandey (@KhoomeiK), Black Forest Labs, MLCommons
Case Study: DeepSeek-V3 Model
Currently applied
● Grouped Experts torch._grouped_mm
● Expert Parallel
● Selective Activation Checkpointing
● NCCL all-to-all for token
dispatch/combine
Future optimizations
● DeepEP-style communication dedup
● DualPipeV pipeline schedule
10
Extension Points: From Example to Framework
TrainSpec
● model
● parallelize fn
● training components
○ tokenizer, data loader
○ optimizer, LR scheduler
○ loss fn, validation
● HF conversion maps
11
ModelConverter
● in-place, runtime model transformations
● quantization / fused implementation
JobConfig
● hierarchical, modularized
● easy extension from base
JobConfig
torchtitan/experiments/
● a middle ground for innovation
○ VLM
○ SimpleFSDP / AutoParallel
○ RL trainer interface
● community contribution is welcome
Ecosystem
Integration with Hugging Face
● adapters between torchtitan <> HF
○ load HF checkpoint
○ publish HF model
○ inference, eval, fine-tune, RL
● Spoiler
○ train HF model natively in torchtitan
torchtitan
Hugging Face
torchforge
12
vLLM
Integration with RL framework
● torchtitan as trainer
Adoptions
Academia
● ICLR 2025 paper
● ICML 2025 ES-FoMo invited talk
● ICML 2025 CODEML workshop
● Numerous papers based on torchtitan
13
Cloud Platforms
● AWS, Crusoe, DataCrunch, IBM,
SkyPilot, Together AI, etc.
Frontier Labs
● e.g. Nous Research, poolside
Next Steps
Training efficiency
● MoE communication overlapping and dedup
● Muon and second order optimizers, distributed
RL foundations
● fundamental capabilities
● framework integration
Compiler-based distributed training
● compiler backend optimizations
● AutoParallel integration
14

AI/ML Infra Meetup | Unlock the Future of Generative AI: TorchTitan's Latest Breakthroughs

  • 1.
    Unlock the futureof Generative AI: TorchTitan’s latest breakthroughs Jiani Wang Software Engineer, PyTorch, Meta Some slides credits to our team member Tianyu Liu, Wan Chao Liang 1
  • 2.
    Foundation Model TrainingBefore Usually.. a mess But Why? Story from poolside 2
  • 3.
    Large Scale TrainingChallenges N-D Parallelism needed for large scale training Scaling laws: Data/Model too big! Need many parallelisms to enable/accelerate model training: ● Data Parallelism: DDP, FSDP ● Model Parallelism: TP/SP, PP Existing solutions often: ● Too complicated to understand and compose together ● Model specific, months of engineering work to enable ● Not ideal for research explorations! Training efficiency ● Kernels/Compiler based fusions ● Mixed precision training (i.e. Float8) ● Activation checkpoint ● … Production ready ● Checkpoint save/load ● Debuggability ● Profiling/Tensorboard/Metrics ● Large model initialization ● … 03 01 02 So many things need to work together for efficient LLM training at scale! 3
  • 4.
    TorchTitan: a PyTorchnative platform for foundation model training Composable N-D Parallelism Strategies ● PyTorch native 3D/4D/5D parallelism that focuses on composability and simplicity ● Decouple model code and infra code (i.e. N-D parallelism implementation) to allow faster research explorations https://github.com/pytorch/torchtitan source: huggingface blog source: FlashAttention paper source: PyTorch blog Efficient Training Techniques ● Computation Fusions: torch.compile, FlashAttention ● Mixed Precision Training recipe: torchao.float8, bfloat16 ● Communication Overlap: Async TP, zero-bubble PP Production Ready Training ● Distributed Checkpoint save/load ● Flight recorder debugging ● Metrics via Tensorboard/Wandb, Profiling ● Large model meta device initialization 4
  • 5.
    Composable N-D ParallelismTraining ● Built-in LLMs (llama 3/4, flux.1, deepseek v3, Qwen3) ● Efficient Data Loading Solution ● Meta device initialization ● Monitor Training Progress All the components built are extensible! End to end model training pipeline ● Fully Sharded Data Parallel v2 ● Tensor Parallel/Sequence Parallel ● Pipeline Parallel ● Context Parallel ● Expert Parallel PyTorch native parallelisms are composable with each other, and other training techniques! Parallelism strategies Distributed Checkpoint Parallelism aware state_dict Efficient save/load/resharding async save for performance 5
  • 6.
    Scaling with TorchTitan4D Parallelism Long Sequence Training Scaling to thousands of GPUs with a few knobs Training Efficient while scaling Extensible to new model architecture Tackle Large Model/Data Size 3 4 2 1 5 Systematic Approach Scaling to Thousands of GPUs 6
  • 7.
    Optimizing Training Efficienciesin TorchTitan Activation Checkpoint (recomputation) Torch.compile on each TransformerBlock Fast compile time Support full graph capture Mixed Precision Training Recipes Bfloat16 example Float8 example 7
  • 8.
    Highlights in 2025 LowPrecision (TorchAO) 04 ● MXFP8 ● rowwise FP8 Fault tolerance (TorchFT) 03 ● Fault Tolerant HSDP ● Semi-Sync Training (LocalSGD, DiLoCo) Compiler-based Parallelism 02 ● SimpleFSDP ● AutoParallel (new repo!) Extended n-D Parallelism 01 ● Context Parallel ● Expert Parallel 8
  • 9.
    Models and Scaling Mix-of-Experts ●DeepSeek-v3 ● Qwen3-MoE ● gpt-oss Diffusion ● FLUX.1 Vision Language ● SigLIP2-Llama3 Dense ● Llama3 ● Qwen3 ● N-D parallel for SotA models in each category 9 * Thanks to our collaborators/contributors: Phúc H. Lê Khắc (@lkhphuc), Yasser Dahou (@YasserdahouML), Ankit Singh (@Griffintaur), Antoni-Joan Solergibert (@TJ-Solergibert), Rohan Pandey (@KhoomeiK), Black Forest Labs, MLCommons
  • 10.
    Case Study: DeepSeek-V3Model Currently applied ● Grouped Experts torch._grouped_mm ● Expert Parallel ● Selective Activation Checkpointing ● NCCL all-to-all for token dispatch/combine Future optimizations ● DeepEP-style communication dedup ● DualPipeV pipeline schedule 10
  • 11.
    Extension Points: FromExample to Framework TrainSpec ● model ● parallelize fn ● training components ○ tokenizer, data loader ○ optimizer, LR scheduler ○ loss fn, validation ● HF conversion maps 11 ModelConverter ● in-place, runtime model transformations ● quantization / fused implementation JobConfig ● hierarchical, modularized ● easy extension from base JobConfig torchtitan/experiments/ ● a middle ground for innovation ○ VLM ○ SimpleFSDP / AutoParallel ○ RL trainer interface ● community contribution is welcome
  • 12.
    Ecosystem Integration with HuggingFace ● adapters between torchtitan <> HF ○ load HF checkpoint ○ publish HF model ○ inference, eval, fine-tune, RL ● Spoiler ○ train HF model natively in torchtitan torchtitan Hugging Face torchforge 12 vLLM Integration with RL framework ● torchtitan as trainer
  • 13.
    Adoptions Academia ● ICLR 2025paper ● ICML 2025 ES-FoMo invited talk ● ICML 2025 CODEML workshop ● Numerous papers based on torchtitan 13 Cloud Platforms ● AWS, Crusoe, DataCrunch, IBM, SkyPilot, Together AI, etc. Frontier Labs ● e.g. Nous Research, poolside
  • 14.
    Next Steps Training efficiency ●MoE communication overlapping and dedup ● Muon and second order optimizers, distributed RL foundations ● fundamental capabilities ● framework integration Compiler-based distributed training ● compiler backend optimizations ● AutoParallel integration 14