Distributed deep learning

LARGE SCALE DISTRIBUTED DEEP
LEARNING
MEHDI SHIBAHARA
@MehdiShibahara

DISTRIBUTING CALCULATIONS
PART 1
2

ABOUT DISTRIBUTED DEEP LEARNING
“SCALE MATTERS”
▸ Training on large datasets can take days or weeks on 1 node with 1 GPU
NVIDIA DGX-1:
8 VOLTA V100 GPUS 
960 TFLOPS 😱 
$149,000 💸

SOLUTION: DISTRIBUTE ON MULTIPLE NODES AND/OR MULTIPLE GPUS
▸ Model parallelism VS Data parallelism (or both at the same time)
Source: http://engineering.skymind.io/distributed-deep-learning-part-1-an-introduction-to-distributed-training-of-neural-networks

MODEL PARALLELISM: ALEXNET TRAINED ON 2 GPUS
“ImageNet Classiﬁcation with Deep Convolutional Neural Networks” (2012)

DATA PARALLELISM: RECENTLY PREFERRED
▸ Forward and backward propagation on same model but with different data
▸ Average parameters (weights/biases) OR updates (gradients) at every iteration
▸ Synchronous OR asynchronous, centralized OR decentralized
EXAMPLE: 
CENTRALIZED SYNCHRONOUS
STOCHASTIC GRADIENT DESCENT
WITH PARAMETER AVERAGING

NO FREE LUNCH
▸ Perfect linear scaling between of performance with workers number
impossible
▸ Common problems: communication overhead (synchronous), stalled gradients
(asynchronous)

FRAMEWORKS IMPLEMENTATION
▸ TensorFlow:
▸ Basic support for multi-GPU on a single node.
▸ Tutorials recommend synchronous SGD with gradients averaged on CPU.
▸ Recently added support for multi-node distributed computation, but library is not
complete yet (as of v1.2).
▸ Supports data parallelism, in-graph or between-graph replication (?), asynchronous
or synchronous SGD.
▸ Communication is based on RPC between masters and workers.

▸ Caffe2:
▸ Supports multi-GPU and multi-node distributed calculation.
▸ Communication uses Facebook’s Gloo (between nodes) and NVIDIA NCCL
(between GPUs).
▸ Offers API for data parallel with synchronous SGD

▸ MXNet:
▸ Supports multi-CPU multi-GPU multi-node for both data and model
parallelism, with both synchronous and asynchronous updates.
▸ Supports variable work loads when GPUs have different specs.
▸ Model update can be either centralized (on CPU) or on device (on main
GPU).
▸ Multi-node supports jobs launched by ssh, MPI, SGE, and Yarn.

▸ CNTK:
▸ Supports multi-GPU and multi-node calculation.
▸ Provides 4 types of SGD algorithms: synchronous data parallel SGD,
asynchronous data parallel SGD, block momentum SGD, and model averaging
SGD.
▸ Communication is based on MPI.
▸ Also offers implementation of 1-bit SGD algorithm, to quantize gradients for
reducing transferred data amount.

▸ Chainer:
▸ API for both model and data parallel computation (single node multiple
GPUs). Model update happens on device (main GPU).
▸ Multi-node support offered through ChainerMN package (requires CUDA-
aware MPI like Open MPI or MVAPICH, and NVIDIA NCCL).
▸ ChainerMN implements data parallelism with synchronous SGD (all-reduce
to average gradients after every backprop iteration).

▸ PyTorch:
▸ Supports data parallel calculation on single node (as of v.0.1.12)
▸ Use nn.DataParallel API to split data among up to 8 GPUs
▸ Multi-node support coming in later versions.

LARGE MINI-BATCH TRAINING
PART 2
14

LARGE SIZE MINI-BATCH TRAINING
RECENT PUBLICATIONS ON LARGE BATCH TRAINING
24 May 2017
8 June 2017
15 Sep 2016

ON LARGE-BATCH TRAINING FOR DEEP
LEARNING: GENERALIZATION GAP AND
SHARP MINIMA
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail
Smelyanskiy, Ping Tak Peter Tang
PAPER #1

ON LARGE-BATCH TRAINING FOR DEEP LEARNING: GENERALIZATION GAP AND SHARP MINIMA
THE “GENERALIZATION GAP”
▸ Models trained with large batch size appear to generalize less well
▸ Happens even when trained without any budget or limits
GENERALIZATION GAP

HYPOTHESIS
▸ Large batch models converge to sharp minimizers
?

CONCLUSIONS
▸ Shows numerical evidence of large-batch methods converging to sharp
minimizers, but no proof
▸ Speculates that sharp minimizers are closer to the starting point, and conﬁrm
that small batch method travel further away than large batch
▸ Attempts with no success to overcome the problem with data augmentation,
conservative training, and robust training.

TRAIN LONGER, GENERALIZE BETTER:
CLOSING THE GENERALIZATION GAP IN LARGE
BATCH TRAINING OF NEURAL NETWORKS
Elad Hoffer, Itay Hubara, Daniel Soudry
PAPER #2

TRAIN LONGER, GENERALIZE BETTER: CLOSING THE GENERALIZATION GAP IN LARGE BATCH
TRAINING OF NEURAL NETWORKS
RANDOM WALK ON RANDOM POTENTIAL PROCESS
▸ Offers different explanation from the “sharp
minima” theory
▸ Describes loss function as a random
potential, and optimization process as a
random walk
▸ Shows empirically that the weight distance
from initialization point increases
logarithmically with the number of
training iterations
Source: https://en.wikipedia.org/wiki/Random_walk

PROPOSED METHOD
▸ Introduces rule for matching different mini-batch sizes:  
 
▸ Increases learning rate with the square root of the mini-batch size
▸ Uses gradient clipping to prevent divergence in ﬁrst few iterations
▸ Implements Ghost Batch Normalization (use smaller virtual batch to acquire
BN statistics)

LIMITATIONS
▸ Learning rate scaling and ghost batch normalization show “good
generalization” for large batch
▸ However small batch still requires less computations

“ACCURATE, LARGE MINIBATCH SGD: 
TRAINING IMAGENET IN 1 HOUR”
Priya Goyal, Piotr Dollar, Ross Girshick, Pieter Noordhuis, Lukasz
Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He
PAPER #3

“ACCURATE, LARGE MINIBATCH SGD: TRAINING IMAGENET IN 1 HOUR”
CONTRIBUTIONS OF THIS PAPER
▸ Offers a practical guide to accurate large-scale training with synchronous
SGD.
▸ Presents a simple linear scaling rule and evaluate it by training a ResNet on
ImageNet.
▸ Introduces a new warm-up process to avoid instability during first few epochs.
▸ Confirms state of the art results in accuracy in record times for multiple
computer vision tasks (classification, detection, segmentation).

PRACTICAL GUIDE (IF YOU HAVE FB’S RESOURCES…)
▸ Hardware:
▸ 8 P100 GPUs per server, connected by NVLink
▸ Multiple servers (custom Big Basin, open source) connected by 50Gbit Ethernet
▸ Software:
▸ Calculations made with Caffe2
▸ Between GPU communication handled by NVIDIA NCCL
▸ Between node communication handled by Gloo (open sourced by FB)

ALGORITHM: DATA PARALLEL WITH SYNCHRONOUS DECENTRALIZED SGD
▸ Gradient aggregation in parallel with backprop to optimize performance
▸ Possible because every layer in the network can be independently updated
▸ “Regular” SGD (without using quantized gradients or block-momentum)
▸ All-reduce aggregation across nodes uses halving/doubling algorithm (to
optimize latency)

EXAMPLE WITH 4 WORKERS (4 GPUS)
Forward
Backward
Update
Aggregate

HALVING/DOUBLING ALGORITHM
GRADIENT GRADIENT GRADIENT GRADIENT
Doubling
Halving

LINEAR SCALING RULE
▸ Allows to scale to multiple workers without sacriﬁcing accuracy and
generalization
▸ All other hyper-parameters can be kept unchanged
▸ Gradual warmup phase helps with instability in early stages:
When the mini batch size is multiplied by k,
multiply the learning rate by k
Linearly increment learning rate from η to k* η at every iteration for the first 5 epochs

INTUITION (IN PARAMETERS SPACE)
START
TARGET 
(LOCAL MINIMUM)
G
radient for batch
32
Gradient for batch 32
Gradient for batch 64
▸ Gradient for twice larger batch size has same information than 2 gradients
of small batch size, allows to take twice larger “steps” (higher learning
rate)

SUBTLETIES
▸ Weight decay: if learning rate is absorbed into the gradient tensor, weight
decay needs to be scaled too
▸ Momentum SGD: similarly, need to apply momentum correction
▸ Batch normalization: statistics are computed separately for every worker
▸ Aggregation: Normalize update vectors by number of workers so that
aggregation becomes all-reduce summation.
▸ Shufﬂing: shufﬂe dataset every epoch and divide among all workers.

EXPERIMENTAL RESULTS
▸ Trained a ResNet-50 model on ImageNet classiﬁcation task for increasing mini-batch sizes
(i.e. increasing number of workers)
▸ Linear scaling rule veriﬁed for mini-batch size up to 8k (=8192 images)
▸ Same result when using ImageNet-5k (5x more images, 6.8 million)

EXPERIMENTAL RESULTS
▸ Large mini-batch size SGD shown to match both the training curves and the
validation error, meaning there is no optimization issues nor
generalization degradation

RUNTIME CHARACTERISTICS
▸ Time per iteration only increases 12% when batch size increases by 44x
▸ Runtime per epoch decreases from 16 minutes to 30 seconds
▸ Training a ResNet-101 model on ImageNet with 256 Tesla P100 GPUs in only 92.5 minutes

GENERALIZATION
▸ Weights trained with large batch size can be used as pre-trained features for
object detection or segmentation (Mask R-CNN model) with no accuracy
loss
▸ Linear scaling rule was also used to train Mask R-CNN (not pre-training) with
no accuracy loss in the range from 1 to 8 GPUs

COMPARISON
Train longer,
generalize better
Accurate, Large
Minibatch SGD
Learning rate
Max batch size 4096 8192
Batch normalization Ghost BN Per worker BN
Required epochs Proportional to M Constant

CONCLUSION:
GO BIG AND GO FAST

Distributed deep learning

More Related Content

What's hot

Similar to Distributed deep learning

Recently uploaded

Distributed deep learning