LARGE SCALE DISTRIBUTED DEEP
LEARNING
MEHDI SHIBAHARA
@MehdiShibahara
DISTRIBUTING CALCULATIONS
PART 1
2
ABOUT DISTRIBUTED DEEP LEARNING
“SCALE MATTERS”
▸ Training on large datasets can take days or weeks on 1 node with 1 GPU
NVIDIA DGX-1:
8 VOLTA V100 GPUS

960 TFLOPS 😱

$149,000 💸
ABOUT DISTRIBUTED DEEP LEARNING
SOLUTION: DISTRIBUTE ON MULTIPLE NODES AND/OR MULTIPLE GPUS
▸ Model parallelism VS Data parallelism (or both at the same time)
Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
ABOUT DISTRIBUTED DEEP LEARNING
MODEL PARALLELISM: ALEXNET TRAINED ON 2 GPUS
“ImageNet Classification with Deep Convolutional Neural Networks” (2012)
ABOUT DISTRIBUTED DEEP LEARNING
DATA PARALLELISM: RECENTLY PREFERRED
▸ Forward and backward propagation on same model but with different data
▸ Average parameters (weights/biases) OR updates (gradients) at every iteration
▸ Synchronous OR asynchronous, centralized OR decentralized
Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
EXAMPLE:

CENTRALIZED SYNCHRONOUS
STOCHASTIC GRADIENT DESCENT
WITH PARAMETER AVERAGING
ABOUT DISTRIBUTED DEEP LEARNING
NO FREE LUNCH
▸ Perfect linear scaling between of performance with workers number
impossible
▸ Common problems: communication overhead (synchronous), stalled gradients
(asynchronous)
Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
ABOUT DISTRIBUTED DEEP LEARNING
FRAMEWORKS IMPLEMENTATION
▸ TensorFlow:
▸ Basic support for multi-GPU on a single node.
▸ Tutorials recommend synchronous SGD with gradients averaged on CPU.
▸ Recently added support for multi-node distributed computation, but library is not
complete yet (as of v1.2).
▸ Supports data parallelism, in-graph or between-graph replication (?), asynchronous
or synchronous SGD.
▸ Communication is based on RPC between masters and workers.
ABOUT DISTRIBUTED DEEP LEARNING
FRAMEWORKS IMPLEMENTATION
▸ Caffe2:
▸ Supports multi-GPU and multi-node distributed calculation.
▸ Communication uses Facebook’s Gloo (between nodes) and NVIDIA NCCL
(between GPUs).
▸ Offers API for data parallel with synchronous SGD
ABOUT DISTRIBUTED DEEP LEARNING
FRAMEWORKS IMPLEMENTATION
▸ MXNet:
▸ Supports multi-CPU multi-GPU multi-node for both data and model
parallelism, with both synchronous and asynchronous updates.
▸ Supports variable work loads when GPUs have different specs.
▸ Model update can be either centralized (on CPU) or on device (on main
GPU).
▸ Multi-node supports jobs launched by ssh, MPI, SGE, and Yarn.
ABOUT DISTRIBUTED DEEP LEARNING
FRAMEWORKS IMPLEMENTATION
▸ CNTK:
▸ Supports multi-GPU and multi-node calculation.
▸ Provides 4 types of SGD algorithms: synchronous data parallel SGD,
asynchronous data parallel SGD, block momentum SGD, and model averaging
SGD.
▸ Communication is based on MPI.
▸ Also offers implementation of 1-bit SGD algorithm, to quantize gradients for
reducing transferred data amount.
ABOUT DISTRIBUTED DEEP LEARNING
FRAMEWORKS IMPLEMENTATION
▸ Chainer:
▸ API for both model and data parallel computation (single node multiple
GPUs). Model update happens on device (main GPU).
▸ Multi-node support offered through ChainerMN package (requires CUDA-
aware MPI like Open MPI or MVAPICH, and NVIDIA NCCL).
▸ ChainerMN implements data parallelism with synchronous SGD (all-reduce
to average gradients after every backprop iteration).
ABOUT DISTRIBUTED DEEP LEARNING
FRAMEWORKS IMPLEMENTATION
▸ PyTorch:
▸ Supports data parallel calculation on single node (as of v.0.1.12)
▸ Use nn.DataParallel API to split data among up to 8 GPUs
▸ Multi-node support coming in later versions.
LARGE MINI-BATCH TRAINING
PART 2
14
LARGE SIZE MINI-BATCH TRAINING
RECENT PUBLICATIONS ON LARGE BATCH TRAINING
24 May 2017
8 June 2017
15 Sep 2016
ON LARGE-BATCH TRAINING FOR DEEP
LEARNING: GENERALIZATION GAP AND
SHARP MINIMA
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail
Smelyanskiy, Ping Tak Peter Tang
PAPER #1
ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA
THE “GENERALIZATION GAP”
▸ Models trained with large batch size appear to generalize less well
▸ Happens even when trained without any budget or limits
GENERALIZATION GAP
ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA
HYPOTHESIS
▸ Large batch models converge to sharp minimizers
?
ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA
CONCLUSIONS
▸ Shows numerical evidence of large-batch methods converging to sharp
minimizers, but no proof
▸ Speculates that sharp minimizers are closer to the starting point, and confirm
that small batch method travel further away than large batch
▸ Attempts with no success to overcome the problem with data augmentation,
conservative training, and robust training.
TRAIN LONGER, GENERALIZE BETTER:
CLOSING THE GENERALIZATION GAP IN LARGE
BATCH TRAINING OF NEURAL NETWORKS
Elad Hoffer, Itay Hubara, Daniel Soudry
PAPER #2
TRAIN LONGER, GENERALIZE BETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH
TRAINING OF NEURAL NETWORKS
RANDOM WALK ON RANDOM POTENTIAL PROCESS
▸ Offers different explanation from the “sharp
minima” theory
▸ Describes loss function as a random
potential, and optimization process as a
random walk
▸ Shows empirically that the weight distance
from initialization point increases
logarithmically with the number of
training iterations
Source: https://en.wikipedia.org/wiki/Random_walk
TRAIN LONGER, GENERALIZE BETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH
TRAINING OF NEURAL NETWORKS
PROPOSED METHOD
▸ Introduces rule for matching different mini-batch sizes: 



▸ Increases learning rate with the square root of the mini-batch size
▸ Uses gradient clipping to prevent divergence in first few iterations
▸ Implements Ghost Batch Normalization (use smaller virtual batch to acquire
BN statistics)
TRAIN LONGER, GENERALIZE BETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH
TRAINING OF NEURAL NETWORKS
LIMITATIONS
▸ Learning rate scaling and ghost batch normalization show “good
generalization” for large batch
▸ However small batch still requires less computations
“ACCURATE, LARGE MINIBATCH SGD:

TRAINING IMAGENET IN 1 HOUR”
Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He
PAPER #3
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
CONTRIBUTIONS OF THIS PAPER
▸ Offers a practical guide to accurate large-scale training with synchronous
SGD.
▸ Presents a simple linear scaling rule and evaluate it by training a ResNet on
ImageNet.
▸ Introduces a new warm-up process to avoid instability during first few epochs.
▸ Confirms state of the art results in accuracy in record times for multiple
computer vision tasks (classification, detection, segmentation).
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
PRACTICAL GUIDE (IF YOU HAVE FB’S RESOURCES…)
▸ Hardware:
▸ 8 P100 GPUs per server, connected by NVLink
▸ Multiple servers (custom Big Basin, open source) connected by 50Gbit Ethernet
▸ Software:
▸ Calculations made with Caffe2
▸ Between GPU communication handled by NVIDIA NCCL
▸ Between node communication handled by Gloo (open sourced by FB)
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
ALGORITHM: DATA PARALLEL WITH SYNCHRONOUS DECENTRALIZED SGD
▸ Gradient aggregation in parallel with backprop to optimize performance
▸ Possible because every layer in the network can be independently updated
▸ “Regular” SGD (without using quantized gradients or block-momentum)
▸ All-reduce aggregation across nodes uses halving/doubling algorithm (to
optimize latency)
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
EXAMPLE WITH 4 WORKERS (4 GPUS)
Forward
Backward
Update
Aggregate
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
HALVING/DOUBLING ALGORITHM
GRADIENT GRADIENT GRADIENT GRADIENT
Doubling
Halving
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
LINEAR SCALING RULE
▸ Allows to scale to multiple workers without sacrificing accuracy and
generalization
▸ All other hyper-parameters can be kept unchanged
▸ Gradual warmup phase helps with instability in early stages:
When the mini batch size is multiplied by k,
multiply the learning rate by k
Linearly increment learning rate from η to k* η at every iteration for the first 5 epochs
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
INTUITION (IN PARAMETERS SPACE)
START
TARGET

(LOCAL MINIMUM)
G
radient for batch
32
Gradient for batch 32
Gradient for batch 64
▸ Gradient for twice larger batch size has same information than 2 gradients
of small batch size, allows to take twice larger “steps” (higher learning
rate)
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
SUBTLETIES
▸ Weight decay: if learning rate is absorbed into the gradient tensor, weight
decay needs to be scaled too
▸ Momentum SGD: similarly, need to apply momentum correction
▸ Batch normalization: statistics are computed separately for every worker
▸ Aggregation: Normalize update vectors by number of workers so that
aggregation becomes all-reduce summation.
▸ Shuffling: shuffle dataset every epoch and divide among all workers.
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
EXPERIMENTAL RESULTS
▸ Trained a ResNet-50 model on ImageNet classification task for increasing mini-batch sizes
(i.e. increasing number of workers)
▸ Linear scaling rule verified for mini-batch size up to 8k (=8192 images)
▸ Same result when using ImageNet-5k (5x more images, 6.8 million)
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
EXPERIMENTAL RESULTS
▸ Large mini-batch size SGD shown to match both the training curves and the
validation error, meaning there is no optimization issues nor
generalization degradation
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
RUNTIME CHARACTERISTICS
▸ Time per iteration only increases 12% when batch size increases by 44x
▸ Runtime per epoch decreases from 16 minutes to 30 seconds
▸ Training a ResNet-101 model on ImageNet with 256 Tesla P100 GPUs in only 92.5 minutes
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
GENERALIZATION
▸ Weights trained with large batch size can be used as pre-trained features for
object detection or segmentation (Mask R-CNN model) with no accuracy
loss
▸ Linear scaling rule was also used to train Mask R-CNN (not pre-training) with
no accuracy loss in the range from 1 to 8 GPUs
“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
COMPARISON
Train longer,
generalize better
Accurate, Large
Minibatch SGD
Learning rate
Max batch size 4096 8192
Batch normalization Ghost BN Per worker BN
Required epochs Proportional to M Constant
CONCLUSION:
GO BIG AND GO FAST

Distributed deep learning

  • 1.
    LARGE SCALE DISTRIBUTEDDEEP LEARNING MEHDI SHIBAHARA @MehdiShibahara
  • 2.
  • 3.
    ABOUT DISTRIBUTED DEEPLEARNING “SCALE MATTERS” ▸ Training on large datasets can take days or weeks on 1 node with 1 GPU NVIDIA DGX-1: 8 VOLTA V100 GPUS
 960 TFLOPS 😱
 $149,000 💸
  • 4.
    ABOUT DISTRIBUTED DEEPLEARNING SOLUTION: DISTRIBUTE ON MULTIPLE NODES AND/OR MULTIPLE GPUS ▸ Model parallelism VS Data parallelism (or both at the same time) Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
  • 5.
    ABOUT DISTRIBUTED DEEPLEARNING MODEL PARALLELISM: ALEXNET TRAINED ON 2 GPUS “ImageNet Classification with Deep Convolutional Neural Networks” (2012)
  • 6.
    ABOUT DISTRIBUTED DEEPLEARNING DATA PARALLELISM: RECENTLY PREFERRED ▸ Forward and backward propagation on same model but with different data ▸ Average parameters (weights/biases) OR updates (gradients) at every iteration ▸ Synchronous OR asynchronous, centralized OR decentralized Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks EXAMPLE:
 CENTRALIZED SYNCHRONOUS STOCHASTIC GRADIENT DESCENT WITH PARAMETER AVERAGING
  • 7.
    ABOUT DISTRIBUTED DEEPLEARNING NO FREE LUNCH ▸ Perfect linear scaling between of performance with workers number impossible ▸ Common problems: communication overhead (synchronous), stalled gradients (asynchronous) Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks
  • 8.
    ABOUT DISTRIBUTED DEEPLEARNING FRAMEWORKS IMPLEMENTATION ▸ TensorFlow: ▸ Basic support for multi-GPU on a single node. ▸ Tutorials recommend synchronous SGD with gradients averaged on CPU. ▸ Recently added support for multi-node distributed computation, but library is not complete yet (as of v1.2). ▸ Supports data parallelism, in-graph or between-graph replication (?), asynchronous or synchronous SGD. ▸ Communication is based on RPC between masters and workers.
  • 9.
    ABOUT DISTRIBUTED DEEPLEARNING FRAMEWORKS IMPLEMENTATION ▸ Caffe2: ▸ Supports multi-GPU and multi-node distributed calculation. ▸ Communication uses Facebook’s Gloo (between nodes) and NVIDIA NCCL (between GPUs). ▸ Offers API for data parallel with synchronous SGD
  • 10.
    ABOUT DISTRIBUTED DEEPLEARNING FRAMEWORKS IMPLEMENTATION ▸ MXNet: ▸ Supports multi-CPU multi-GPU multi-node for both data and model parallelism, with both synchronous and asynchronous updates. ▸ Supports variable work loads when GPUs have different specs. ▸ Model update can be either centralized (on CPU) or on device (on main GPU). ▸ Multi-node supports jobs launched by ssh, MPI, SGE, and Yarn.
  • 11.
    ABOUT DISTRIBUTED DEEPLEARNING FRAMEWORKS IMPLEMENTATION ▸ CNTK: ▸ Supports multi-GPU and multi-node calculation. ▸ Provides 4 types of SGD algorithms: synchronous data parallel SGD, asynchronous data parallel SGD, block momentum SGD, and model averaging SGD. ▸ Communication is based on MPI. ▸ Also offers implementation of 1-bit SGD algorithm, to quantize gradients for reducing transferred data amount.
  • 12.
    ABOUT DISTRIBUTED DEEPLEARNING FRAMEWORKS IMPLEMENTATION ▸ Chainer: ▸ API for both model and data parallel computation (single node multiple GPUs). Model update happens on device (main GPU). ▸ Multi-node support offered through ChainerMN package (requires CUDA- aware MPI like Open MPI or MVAPICH, and NVIDIA NCCL). ▸ ChainerMN implements data parallelism with synchronous SGD (all-reduce to average gradients after every backprop iteration).
  • 13.
    ABOUT DISTRIBUTED DEEPLEARNING FRAMEWORKS IMPLEMENTATION ▸ PyTorch: ▸ Supports data parallel calculation on single node (as of v.0.1.12) ▸ Use nn.DataParallel API to split data among up to 8 GPUs ▸ Multi-node support coming in later versions.
  • 14.
  • 15.
    LARGE SIZE MINI-BATCHTRAINING RECENT PUBLICATIONS ON LARGE BATCH TRAINING 24 May 2017 8 June 2017 15 Sep 2016
  • 16.
    ON LARGE-BATCH TRAININGFOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, Ping Tak Peter Tang PAPER #1
  • 17.
    ON LARGE-BATCH TRAININGFOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA THE “GENERALIZATION GAP” ▸ Models trained with large batch size appear to generalize less well ▸ Happens even when trained without any budget or limits GENERALIZATION GAP
  • 18.
    ON LARGE-BATCH TRAININGFOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA HYPOTHESIS ▸ Large batch models converge to sharp minimizers ?
  • 19.
    ON LARGE-BATCH TRAININGFOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA CONCLUSIONS ▸ Shows numerical evidence of large-batch methods converging to sharp minimizers, but no proof ▸ Speculates that sharp minimizers are closer to the starting point, and confirm that small batch method travel further away than large batch ▸ Attempts with no success to overcome the problem with data augmentation, conservative training, and robust training.
  • 20.
    TRAIN LONGER, GENERALIZEBETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH TRAINING OF NEURAL NETWORKS Elad Hoffer, Itay Hubara, Daniel Soudry PAPER #2
  • 21.
    TRAIN LONGER, GENERALIZEBETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH TRAINING OF NEURAL NETWORKS RANDOM WALK ON RANDOM POTENTIAL PROCESS ▸ Offers different explanation from the “sharp minima” theory ▸ Describes loss function as a random potential, and optimization process as a random walk ▸ Shows empirically that the weight distance from initialization point increases logarithmically with the number of training iterations Source: https://en.wikipedia.org/wiki/Random_walk
  • 22.
    TRAIN LONGER, GENERALIZEBETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH TRAINING OF NEURAL NETWORKS PROPOSED METHOD ▸ Introduces rule for matching different mini-batch sizes: 
 
 ▸ Increases learning rate with the square root of the mini-batch size ▸ Uses gradient clipping to prevent divergence in first few iterations ▸ Implements Ghost Batch Normalization (use smaller virtual batch to acquire BN statistics)
  • 23.
    TRAIN LONGER, GENERALIZEBETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH TRAINING OF NEURAL NETWORKS LIMITATIONS ▸ Learning rate scaling and ghost batch normalization show “good generalization” for large batch ▸ However small batch still requires less computations
  • 24.
    “ACCURATE, LARGE MINIBATCHSGD:
 TRAINING IMAGENET IN 1 HOUR” Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He PAPER #3
  • 25.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” CONTRIBUTIONS OF THIS PAPER ▸ Offers a practical guide to accurate large-scale training with synchronous SGD. ▸ Presents a simple linear scaling rule and evaluate it by training a ResNet on ImageNet. ▸ Introduces a new warm-up process to avoid instability during first few epochs. ▸ Confirms state of the art results in accuracy in record times for multiple computer vision tasks (classification, detection, segmentation).
  • 26.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” PRACTICAL GUIDE (IF YOU HAVE FB’S RESOURCES…) ▸ Hardware: ▸ 8 P100 GPUs per server, connected by NVLink ▸ Multiple servers (custom Big Basin, open source) connected by 50Gbit Ethernet ▸ Software: ▸ Calculations made with Caffe2 ▸ Between GPU communication handled by NVIDIA NCCL ▸ Between node communication handled by Gloo (open sourced by FB)
  • 27.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” ALGORITHM: DATA PARALLEL WITH SYNCHRONOUS DECENTRALIZED SGD ▸ Gradient aggregation in parallel with backprop to optimize performance ▸ Possible because every layer in the network can be independently updated ▸ “Regular” SGD (without using quantized gradients or block-momentum) ▸ All-reduce aggregation across nodes uses halving/doubling algorithm (to optimize latency)
  • 28.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” EXAMPLE WITH 4 WORKERS (4 GPUS) Forward Backward Update Aggregate
  • 29.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” HALVING/DOUBLING ALGORITHM GRADIENT GRADIENT GRADIENT GRADIENT Doubling Halving
  • 30.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” LINEAR SCALING RULE ▸ Allows to scale to multiple workers without sacrificing accuracy and generalization ▸ All other hyper-parameters can be kept unchanged ▸ Gradual warmup phase helps with instability in early stages: When the mini batch size is multiplied by k, multiply the learning rate by k Linearly increment learning rate from η to k* η at every iteration for the first 5 epochs
  • 31.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” INTUITION (IN PARAMETERS SPACE) START TARGET
 (LOCAL MINIMUM) G radient for batch 32 Gradient for batch 32 Gradient for batch 64 ▸ Gradient for twice larger batch size has same information than 2 gradients of small batch size, allows to take twice larger “steps” (higher learning rate)
  • 32.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” SUBTLETIES ▸ Weight decay: if learning rate is absorbed into the gradient tensor, weight decay needs to be scaled too ▸ Momentum SGD: similarly, need to apply momentum correction ▸ Batch normalization: statistics are computed separately for every worker ▸ Aggregation: Normalize update vectors by number of workers so that aggregation becomes all-reduce summation. ▸ Shuffling: shuffle dataset every epoch and divide among all workers.
  • 33.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” EXPERIMENTAL RESULTS ▸ Trained a ResNet-50 model on ImageNet classification task for increasing mini-batch sizes (i.e. increasing number of workers) ▸ Linear scaling rule verified for mini-batch size up to 8k (=8192 images) ▸ Same result when using ImageNet-5k (5x more images, 6.8 million)
  • 34.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” EXPERIMENTAL RESULTS ▸ Large mini-batch size SGD shown to match both the training curves and the validation error, meaning there is no optimization issues nor generalization degradation
  • 35.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” RUNTIME CHARACTERISTICS ▸ Time per iteration only increases 12% when batch size increases by 44x ▸ Runtime per epoch decreases from 16 minutes to 30 seconds ▸ Training a ResNet-101 model on ImageNet with 256 Tesla P100 GPUs in only 92.5 minutes
  • 36.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” GENERALIZATION ▸ Weights trained with large batch size can be used as pre-trained features for object detection or segmentation (Mask R-CNN model) with no accuracy loss ▸ Linear scaling rule was also used to train Mask R-CNN (not pre-training) with no accuracy loss in the range from 1 to 8 GPUs
  • 37.
    “ACCURATE, LARGE MINIBATCHSGD: TRAINING IMAGENET IN 1 HOUR” COMPARISON Train longer, generalize better Accurate, Large Minibatch SGD Learning rate Max batch size 4096 8192 Batch normalization Ghost BN Per worker BN Required epochs Proportional to M Constant
  • 38.