AI at Scale
COGNITIVE SYSTEMS
Ing. Florin Manaila
Senior Architect and Inventor
Cognitive Systems (Distributed Deep Learning and HPC)
IBM Systems Hardware Europe
Member of the IBM Academy of Technology (AoT)
July 9, 2020
Technical R&D today disruption
2
New
Product
New
Product
Opportunistic
Discovery
by Humans
Simulation
Experiments
Simulation &
Inference
ExperimentsComprehensive
Discovery by
Cognitive
Today Cognitive Discovery
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Next-Generation
Infrastructure Stack
3IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Problem
―
4
 Datasets are large and growing
 The size of a batch of samples is large and growing
 Sample sizes are large and growing
 More and more sophisticated models are being designed, some with
hundreds of layers
 GPU memory capacity is growing as well (but slower)
 Limited by cost, technology, physical space
 Energy costs is increase YY
 Large CO2e / training cycles
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
What’s in the training of deep neural networks?
Neural network model
Billions of parameters
Gigabytes
Computation
Iterative gradient based search
Millions of iterations
Mainly matrix operations
Data
Millions of images, sentences
Terabytes
Workload characteristics: Both compute and data intensive!
5IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Distributed Deep Learning
Common options
6
SINGLE
ACCELERATOR
DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL
1x Accelerator 4x Accelerators 4x Accelerators
4x n Accelerators
Longer Training Time Shorter Training Time
System1System2Systemn
System
Data
Data
DataDataDataData
DataDataData
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Node 0
Data-Parallel Framework
Distributed Learning
Partition 0
GPU 0
GPU 1
GPU 2
GPU 3
Partition (0,0)
Partition (0,1)
Partition (0,2)
Partition (0,3)
Node 1
Partition 1
GPU 0
GPU 1
GPU 2
GPU 3
Partition (1,0)
Partition (1,1)
Partition (1,2)
Partition (1,3)
7
Large Dataset
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Scaling
Misperception
8IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
AI Frameworks and Multi-GPU
Single GPU utilization
9
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
If not told explicitly, AI
Frameworks makes
use of a single GPU!
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
10
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
CP
U
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
Coherent
access to
system memory
(2TB)
NVLink
150GB/s
NVLink
150GB/s
170GB/s
Multi-Host Socket Direct
PCIe Gen4, CAPI 2.0 Infiniband
NVLink
150GB/s NVDIA V100NVDIA V100
DDR4
X-Bus 4B
CP
U
PCIe NVMe Flash
Storage
Use explicitly, multi GPU
model in AI Frameworks
makes use of all GPUs
available on the host ore
assigned by SLURM if
interactive session is
requested with a specific
number of GPUs!
AI Frameworks and Multi-GPU
4x GPUs utilization
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
11
InfiniBand EDR Switch
AI Frameworks and Multi-GPU
12x GPU utilization using collective communication operation called “AllReduce
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Multi GPU in Keras
Scenarios
Training models with
weights merge on CPU
Training models with
weights merge on CPU
using cpu_relocation
(recommended for
IC922)
Training models with
weights merge on GPU
(recommended for
AC922)
12IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Issues
 Batch Size
 GPU data starvation aka the CPUs can’t keep up with the GPUs
 Saving your parallel models
 Counting the availableGPUs has a nasty side-effect
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Issues
GPU data starvation
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on CPU
Example 1
import tensorflow as tf
from keras.applications import Xception
from keras.utils import multi_gpu_model
import numpy as np
num_samples = 1000
height = 224
width = 224
num_classes = 1000
# Instantiate the base model (or "template" model).
# We recommend doing this with under a CPU device scope,
# so that the model's weights are hosted on CPU memory.
# Otherwise they may end up hosted on a GPU, which would
# complicate weight sharing.
with tf.device('/cpu:0'):
model = Xception(weights=None,
input_shape=(height, width, 3),
classes=num_classes)
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on CPU
Example 1
# Replicates the model on 4 GPUs.
# This assumes that your machine has 4 available GPUs.
parallel_model = multi_gpu_model(model, gpus=4)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
# Generate dummy data.
x = np.random.random((num_samples, height, width, 3))
y = np.random.random((num_samples, num_classes))
# This `fit` call will be distributed on 8 GPUs.
# Since the batch size is 256, each GPU will process 32 samples.
parallel_model.fit(x, y, epochs=20, batch_size=256)
# Save model via the template model (which shares the same weights):
model.save('my_model.h5')
NOTE:
To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model.
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on CPU using cpu_relocation
Example 2
..
# Not needed to change the device scope for model definition:
model = Xception(weights=None, ..)
try:
parallel_model = multi_gpu_model(model, cpu_relocation=True)
print("Training using multiple GPUs..")
except ValueError:
parallel_model = model
print("Training using single GPU or CPU..")
parallel_model.compile(..)
..
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Training models with weights merge on GPU (recommended for AC922)
Example 3
..
# Not needed to change the device scope for model definition:
model = Xception(weights=None, ..)
try:
parallel_model = multi_gpu_model(model, cpu_merge=False)
print("Training using multiple GPUs..")
except:
parallel_model = model
print("Training using single GPU or CPU..")
parallel_model.compile(..)
..
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Next-Generation
Software Stack
19IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
20
AI Infrastructure Stack
ON-CLOUD and ON-PREM
Transform & Prep
Data (ETL)
Micro-Services / Applications
Governance AI
(Fairness, Explainable AI,
Model Health, Accuracy)
APIs
(external and in-house)
Machine & Deep Learning
Libraries & Frameworks
Distributed Computing
Data Lake & Data Stores
Segment Specific:
Finance, Retail, Healthcare,
Automotive
Speech, Vision,
NLP, Sentiment
TensorFlow, Caffe,
Pytorch
SparkML, Snap.ML
Spark, MPI
Hadoop HDFS,
NoSQL DBs,
Parallel File
System
Accelerated
Infrastructure
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Watson ML Community Edition (WMLCE)
21
CUDA
TensorRTTensorFlow Caffe2
RAPIDS.AI: cuDF, cuML
LIBS:
DDL
Large Model Support (LMSv2)
SnapML
Local, MPI, Spark
DASK
Pytorch
Estimator, Probability,
Serving, Tensorboard APEX XGBoostBazel
libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp,
arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc
NCCL cuDNN
Spectrum MPI
AIX360
delivered via
Bare Metal or Containers
ONNX
Version1.7.0
Horovod AIF360
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Watson ML Community Edition (WMLCE)
22
CUDA
TensorFlow Caffe2
RAPIDS.AI: cuDF, cuML
LIBS:
SnapML
Local, MPI, Spark
DASK
Pytorch
Estimator, Probability,
Serving, Tensorboard APEX XGBoostBazel
libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy,
py-oepncv, arrow-cpp etc
NCCL cuDNN
delivered via
Bare Metal or Containers
Version1.7.0
TensorFlow
Serving Server
TensorRT
ONNX
Protobuf
Training
Inference
DDL
Large Model Support (LMSv2)Spectrum MPI
AIX360
ONNX
Horovod AIF360
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
AI Eplainability and Fairness toolkits on POWER
23IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
24
IBMWatsonMachineLearning
CommunityEdition
DockerContainers
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
WMLCE - Installation
25
$ conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/
$ conda create --name wmlce-1.7.0 python=3.7
$ conda activate wmlce-1.7.0
$ conda install powerai
$ conda install powerai-rapids
Optional Packages:
After you have install Anaconda on your user profile, add IBM WMLCE conda channel:
Create a python virtual environment:
Install WMLCE:
$ conda install py-xgboost-gpu
https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html#wmlce_install__install
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
IBM Large Model
Support (LMS)
26IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
IBM Large Model Support (LMS)
Swap-out unused parameters to large CPU memory (TB order)
27IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
CPU memory
GPU memory
l+1l-1 LLayer 1
Loss
Function
…..…
……….
…...
Forward
Backward
l
…….
GPU memory
Swap-out
Swap-in
Normal Backpropagation Backpropagation in LMS(Swap)
Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory
Background
Neural Network is growing deeper and wider
In near future, memory to keep the network parameters may exceed the GPU memory (16GB, 40GB, etc)
 Large Model Support is required in deep learning frameworks
CPU-GPU NVLink plays the key role
IBM Large Model Support (LMS)
28
Allow seamlessly moves layers of a model between the GPU and CPU to overcome GPU memory limits allows
training of:
 Deeper models
 Higher resolution data
 Larger batch sizes
PytorchTensorFlow
TFLMSv2 introduces four hyper-parameters to work
with:
 swapout_threshold: The number of tensors to hold within
GPU memory before pushing them to system memory.
 swapin_ahead: The larger swapin_ahead is, the earlier a
tensor is swapped in to the GPU memory from the host
memory.
 swapin_groupby: Multiple swap-in operations of the
same tensor will be grouped or fused into one swap-in
operation for better performance if they are close to each
other (the distance between them is within
swapin_groupby).
 sync_mode: Whether to do synchronisation between
data transfer and kernel computation or not.
Keras
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
What’s possible with Large Model Support
29
 8.3x image resolution - Keras ResNet50
 14.4x image resolution – ResNet152v2
 7x MRI resolution - 3D U-Net 3D image segmentation
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Distributed
Deep Learning
30IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Distributed Deep Learning
Goals
31
The overall goal of distributed
deep learning is to reduce the
training time
To this end the primary features:
 Automatic Topology Detection
 Rankfile generation
 Automatic mpirun option handling
 Efficiency in scalability
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
32
Distributed Deep Learning
How is working?
 A process is created for each GPU in the cluster
 Each process contains a copy of the model
 Mini-batch is spread across all of the processes
 Each process uses different input data
 After each iteration, all of the processes sync and average together their
gradients
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
33
Tools and Libraries
The following tools are libraries, which provide the communication functions necessary to perform distributed
training. Primarily allReduce and broadcast functions.
 IBM Spectrum MPI: Classic tool for distributed computing. Still commonly used for distributed deep
learning.
 NVIDIA NCCL: Nvidia’s gpu-to-gpu communication library. Since NCCL2, between-node communication
is supported.
 IBM DDL: Provides a topology-aware all-Reduce. Capable of optimally dividing communication across
hierarchies of fabrics. Utilizes different communication protocols at different hierarchies. When WMLCE is
installed all related frameworks are comming with IBM DDL support, you don’t have to compile additional
software packages, only to modify your training scripts to make use of the need distributed deep learning
APIs.Integrations into deep learning frameworks to enable distributed training is using common communication
libraries such as:
 TensorFlow Distribution Strategies. Native Tensorflow distribution methods.
 IBM DDL. Provides integrations into common frameworks, including a Tensorflow operator that integrates
IBM DDL with Tensorflow and similar for Pytorch.
 Horovod [Sergeev et al. 2018]. Provides integration libraries into common frameworks which enable
distributed training with common communication libraries, including. IBM DDL or NCCL can be used as
backend for Horovod implementation.
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod
distributed training framework
34
 Distributed training framework for:
• TensorFlow
• Keras
• PyTorch
 Separates infrastructure from ML
 Easy installation on top of ML frameworks:
 Best performance with NCCL or DDL - uses
bandwidth-optimal communication protocols
(NVLINK, RDMA (InfiniBand, RoCE)) if available
 Named after traditional Russian fold dance where
participants dance in a circle with linked hands
$ conda install horovod
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod with DDL
Running
35
$ ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python
hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 --
variable_update=horovod
I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ====
...
----------------------------------------------------------------
total images/sec: 5682.34
----------------------------------------------------------------
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod Architecture
• Multiple Towers (here 2)
• Each Tower:
 Runs in context of individual OS
process (own PID)
 Has own data pipeline to read and
augment
 Runs own training step
 Synchronization Step via
hvd.DistributedOptimizer()
Tower
(indiv. process)
Tower
(indiv. process)
...
hvd.DistributedOptimizer()
point of gradient
synchronization
1.st NCCL log output
Rank 0 Rank 1
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
How to Start Horovod Jobs
Samples below, train.py is our training code:
• 2 GPUs: mpirun -np 2 --allow-run-as-root -H localhost:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
• 4 GPUs: mpirun -np 4 --allow-run-as-root -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
• Single GPU: mpirun -np 1 --allow-run-as-root -H localhost:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x
LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py
A better way is to use horovodrun or ddlrun:
One Training codebase, One way to start (MPI based), Easy to orchestrate, No parameter server
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
horovodrun -np 16 -H compute1:4,compute2:4,compute3:4,compute4:4 python train.py
Horovod Rank Enumeration
 Parameters given by Horovod
• hvd.size() – total amount of GPUs working in this job
• hvd.rank() – rank id assigned to this specific tower/worker
o Perform special steps in single rank (mostly rank 0)
o Checkpointing
o TensorBoard log writing
 Pitfall: hvd.local_rank() is not unique when doing multi-node jobs!
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Horovod MPI Rank Enumeration
A sample with 4 GPUs on 2 nodes
Tower
(indiv. process)
Tower
(indiv. process)
Caller
hvd.rank() = 0
hvd.local_rank() = 0
Tower
(indiv. process)
Tower
(indiv. process)
hvd.rank() = 1
hvd.local_rank() = 1
hvd.rank() = 2
hvd.local_rank() = 0
hvd.rank() = 3
hvd.local_rank() = 1
hvd.size() = 4
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
WMLCE
and SLURM
integration
40IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
41
SLURM template example for 4x IBM AC922s
Batch AI
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
42
SLURM template example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
43
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
44
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
45
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
46
MNIST example for 4x IBM AC922s
Batch AI with Horovod and Pytorch
IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
47IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
Thank you
48
Florin Manaila
—
florin.manaila@de.ibm.com
ibm.com
49

IBM AI at Scale

  • 1.
    AI at Scale COGNITIVESYSTEMS Ing. Florin Manaila Senior Architect and Inventor Cognitive Systems (Distributed Deep Learning and HPC) IBM Systems Hardware Europe Member of the IBM Academy of Technology (AoT) July 9, 2020
  • 2.
    Technical R&D todaydisruption 2 New Product New Product Opportunistic Discovery by Humans Simulation Experiments Simulation & Inference ExperimentsComprehensive Discovery by Cognitive Today Cognitive Discovery IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 3.
    Next-Generation Infrastructure Stack 3IBM CognitiveSystems Europe / July 9 / © 2020 IBM Corporation
  • 4.
    Problem ― 4  Datasets arelarge and growing  The size of a batch of samples is large and growing  Sample sizes are large and growing  More and more sophisticated models are being designed, some with hundreds of layers  GPU memory capacity is growing as well (but slower)  Limited by cost, technology, physical space  Energy costs is increase YY  Large CO2e / training cycles IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 5.
    What’s in thetraining of deep neural networks? Neural network model Billions of parameters Gigabytes Computation Iterative gradient based search Millions of iterations Mainly matrix operations Data Millions of images, sentences Terabytes Workload characteristics: Both compute and data intensive! 5IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 6.
    Distributed Deep Learning Commonoptions 6 SINGLE ACCELERATOR DATA PARALLEL MODEL PARALLEL DATA AND MODEL PARALLEL 1x Accelerator 4x Accelerators 4x Accelerators 4x n Accelerators Longer Training Time Shorter Training Time System1System2Systemn System Data Data DataDataDataData DataDataData IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 7.
    Node 0 Data-Parallel Framework DistributedLearning Partition 0 GPU 0 GPU 1 GPU 2 GPU 3 Partition (0,0) Partition (0,1) Partition (0,2) Partition (0,3) Node 1 Partition 1 GPU 0 GPU 1 GPU 2 GPU 3 Partition (1,0) Partition (1,1) Partition (1,2) Partition (1,3) 7 Large Dataset IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 8.
    Scaling Misperception 8IBM Cognitive SystemsEurope / July 9 / © 2020 IBM Corporation
  • 9.
    AI Frameworks andMulti-GPU Single GPU utilization 9 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s CP U NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s Multi-Host Socket Direct PCIe Gen4, CAPI 2.0 Infiniband NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 X-Bus 4B CP U PCIe NVMe Flash Storage If not told explicitly, AI Frameworks makes use of a single GPU! IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 10.
    10 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s CP U NVLink 150GB/sNVDIA V100NVDIA V100 DDR4 Coherent access to system memory (2TB) NVLink 150GB/s NVLink 150GB/s 170GB/s Multi-Host Socket Direct PCIe Gen4, CAPI 2.0 Infiniband NVLink 150GB/s NVDIA V100NVDIA V100 DDR4 X-Bus 4B CP U PCIe NVMe Flash Storage Use explicitly, multi GPU model in AI Frameworks makes use of all GPUs available on the host ore assigned by SLURM if interactive session is requested with a specific number of GPUs! AI Frameworks and Multi-GPU 4x GPUs utilization IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 11.
    11 InfiniBand EDR Switch AIFrameworks and Multi-GPU 12x GPU utilization using collective communication operation called “AllReduce IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 12.
    Multi GPU inKeras Scenarios Training models with weights merge on CPU Training models with weights merge on CPU using cpu_relocation (recommended for IC922) Training models with weights merge on GPU (recommended for AC922) 12IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 13.
    Issues  Batch Size GPU data starvation aka the CPUs can’t keep up with the GPUs  Saving your parallel models  Counting the availableGPUs has a nasty side-effect IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 14.
    Issues GPU data starvation IBMCognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 15.
    Training models withweights merge on CPU Example 1 import tensorflow as tf from keras.applications import Xception from keras.utils import multi_gpu_model import numpy as np num_samples = 1000 height = 224 width = 224 num_classes = 1000 # Instantiate the base model (or "template" model). # We recommend doing this with under a CPU device scope, # so that the model's weights are hosted on CPU memory. # Otherwise they may end up hosted on a GPU, which would # complicate weight sharing. with tf.device('/cpu:0'): model = Xception(weights=None, input_shape=(height, width, 3), classes=num_classes) IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 16.
    Training models withweights merge on CPU Example 1 # Replicates the model on 4 GPUs. # This assumes that your machine has 4 available GPUs. parallel_model = multi_gpu_model(model, gpus=4) parallel_model.compile(loss='categorical_crossentropy', optimizer='rmsprop') # Generate dummy data. x = np.random.random((num_samples, height, width, 3)) y = np.random.random((num_samples, num_classes)) # This `fit` call will be distributed on 8 GPUs. # Since the batch size is 256, each GPU will process 32 samples. parallel_model.fit(x, y, epochs=20, batch_size=256) # Save model via the template model (which shares the same weights): model.save('my_model.h5') NOTE: To save the multi-gpu model, use .save(fname) or .save_weights(fname) with the template model (the argument you passed to multi_gpu_model), rather than the model returned by multi_gpu_model. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 17.
    Training models withweights merge on CPU using cpu_relocation Example 2 .. # Not needed to change the device scope for model definition: model = Xception(weights=None, ..) try: parallel_model = multi_gpu_model(model, cpu_relocation=True) print("Training using multiple GPUs..") except ValueError: parallel_model = model print("Training using single GPU or CPU..") parallel_model.compile(..) .. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 18.
    Training models withweights merge on GPU (recommended for AC922) Example 3 .. # Not needed to change the device scope for model definition: model = Xception(weights=None, ..) try: parallel_model = multi_gpu_model(model, cpu_merge=False) print("Training using multiple GPUs..") except: parallel_model = model print("Training using single GPU or CPU..") parallel_model.compile(..) .. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 19.
    Next-Generation Software Stack 19IBM CognitiveSystems Europe / July 9 / © 2020 IBM Corporation
  • 20.
    20 AI Infrastructure Stack ON-CLOUDand ON-PREM Transform & Prep Data (ETL) Micro-Services / Applications Governance AI (Fairness, Explainable AI, Model Health, Accuracy) APIs (external and in-house) Machine & Deep Learning Libraries & Frameworks Distributed Computing Data Lake & Data Stores Segment Specific: Finance, Retail, Healthcare, Automotive Speech, Vision, NLP, Sentiment TensorFlow, Caffe, Pytorch SparkML, Snap.ML Spark, MPI Hadoop HDFS, NoSQL DBs, Parallel File System Accelerated Infrastructure IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 21.
    Watson ML CommunityEdition (WMLCE) 21 CUDA TensorRTTensorFlow Caffe2 RAPIDS.AI: cuDF, cuML LIBS: DDL Large Model Support (LMSv2) SnapML Local, MPI, Spark DASK Pytorch Estimator, Probability, Serving, Tensorboard APEX XGBoostBazel libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc NCCL cuDNN Spectrum MPI AIX360 delivered via Bare Metal or Containers ONNX Version1.7.0 Horovod AIF360 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 22.
    Watson ML CommunityEdition (WMLCE) 22 CUDA TensorFlow Caffe2 RAPIDS.AI: cuDF, cuML LIBS: SnapML Local, MPI, Spark DASK Pytorch Estimator, Probability, Serving, Tensorboard APEX XGBoostBazel libevent, libgdf, libgdf_cffi, libopencv, libprotobuf, parquet-cpp, thrift-cpp, arrow-cpp, pyarrow, gflags, magma, cupy, py-oepncv, arrow-cpp etc NCCL cuDNN delivered via Bare Metal or Containers Version1.7.0 TensorFlow Serving Server TensorRT ONNX Protobuf Training Inference DDL Large Model Support (LMSv2)Spectrum MPI AIX360 ONNX Horovod AIF360 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 23.
    AI Eplainability andFairness toolkits on POWER 23IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 24.
  • 25.
    WMLCE - Installation 25 $conda config --prepend channels https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/ $ conda create --name wmlce-1.7.0 python=3.7 $ conda activate wmlce-1.7.0 $ conda install powerai $ conda install powerai-rapids Optional Packages: After you have install Anaconda on your user profile, add IBM WMLCE conda channel: Create a python virtual environment: Install WMLCE: $ conda install py-xgboost-gpu https://www.ibm.com/support/knowledgecenter/SS5SF7_1.7.0/navigation/wmlce_install.html#wmlce_install__install IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 26.
    IBM Large Model Support(LMS) 26IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 27.
    IBM Large ModelSupport (LMS) Swap-out unused parameters to large CPU memory (TB order) 27IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation l+1l-1 LLayer 1 Loss Function …..… ………. …... Forward Backward l ……. CPU memory GPU memory l+1l-1 LLayer 1 Loss Function …..… ………. …... Forward Backward l ……. GPU memory Swap-out Swap-in Normal Backpropagation Backpropagation in LMS(Swap) Keep unused parameters in GPU memory Swap-out unused parameters to CPU memory Background Neural Network is growing deeper and wider In near future, memory to keep the network parameters may exceed the GPU memory (16GB, 40GB, etc)  Large Model Support is required in deep learning frameworks CPU-GPU NVLink plays the key role
  • 28.
    IBM Large ModelSupport (LMS) 28 Allow seamlessly moves layers of a model between the GPU and CPU to overcome GPU memory limits allows training of:  Deeper models  Higher resolution data  Larger batch sizes PytorchTensorFlow TFLMSv2 introduces four hyper-parameters to work with:  swapout_threshold: The number of tensors to hold within GPU memory before pushing them to system memory.  swapin_ahead: The larger swapin_ahead is, the earlier a tensor is swapped in to the GPU memory from the host memory.  swapin_groupby: Multiple swap-in operations of the same tensor will be grouped or fused into one swap-in operation for better performance if they are close to each other (the distance between them is within swapin_groupby).  sync_mode: Whether to do synchronisation between data transfer and kernel computation or not. Keras IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 29.
    What’s possible withLarge Model Support 29  8.3x image resolution - Keras ResNet50  14.4x image resolution – ResNet152v2  7x MRI resolution - 3D U-Net 3D image segmentation IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 30.
    Distributed Deep Learning 30IBM CognitiveSystems Europe / July 9 / © 2020 IBM Corporation
  • 31.
    Distributed Deep Learning Goals 31 Theoverall goal of distributed deep learning is to reduce the training time To this end the primary features:  Automatic Topology Detection  Rankfile generation  Automatic mpirun option handling  Efficiency in scalability IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 32.
    32 Distributed Deep Learning Howis working?  A process is created for each GPU in the cluster  Each process contains a copy of the model  Mini-batch is spread across all of the processes  Each process uses different input data  After each iteration, all of the processes sync and average together their gradients IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 33.
    33 Tools and Libraries Thefollowing tools are libraries, which provide the communication functions necessary to perform distributed training. Primarily allReduce and broadcast functions.  IBM Spectrum MPI: Classic tool for distributed computing. Still commonly used for distributed deep learning.  NVIDIA NCCL: Nvidia’s gpu-to-gpu communication library. Since NCCL2, between-node communication is supported.  IBM DDL: Provides a topology-aware all-Reduce. Capable of optimally dividing communication across hierarchies of fabrics. Utilizes different communication protocols at different hierarchies. When WMLCE is installed all related frameworks are comming with IBM DDL support, you don’t have to compile additional software packages, only to modify your training scripts to make use of the need distributed deep learning APIs.Integrations into deep learning frameworks to enable distributed training is using common communication libraries such as:  TensorFlow Distribution Strategies. Native Tensorflow distribution methods.  IBM DDL. Provides integrations into common frameworks, including a Tensorflow operator that integrates IBM DDL with Tensorflow and similar for Pytorch.  Horovod [Sergeev et al. 2018]. Provides integration libraries into common frameworks which enable distributed training with common communication libraries, including. IBM DDL or NCCL can be used as backend for Horovod implementation. IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 34.
    Horovod distributed training framework 34 Distributed training framework for: • TensorFlow • Keras • PyTorch  Separates infrastructure from ML  Easy installation on top of ML frameworks:  Best performance with NCCL or DDL - uses bandwidth-optimal communication protocols (NVLINK, RDMA (InfiniBand, RoCE)) if available  Named after traditional Russian fold dance where participants dance in a circle with linked hands $ conda install horovod IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 35.
    Horovod with DDL Running 35 $ddlrun -H host1,host2,host3,host4 -mpiarg "-x HOROVOD_FUSION_THRESHOLD=16777216" python hpms/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet50 --batch_size 64 -- variable_update=horovod I 20:42:52.209 12173 12173 DDL:29 ] [MPI:0 ] ==== IBM Corp. DDL 1.1.0 + (MPI 3.1) ==== ... ---------------------------------------------------------------- total images/sec: 5682.34 ---------------------------------------------------------------- IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 36.
    Horovod Architecture • MultipleTowers (here 2) • Each Tower:  Runs in context of individual OS process (own PID)  Has own data pipeline to read and augment  Runs own training step  Synchronization Step via hvd.DistributedOptimizer() Tower (indiv. process) Tower (indiv. process) ... hvd.DistributedOptimizer() point of gradient synchronization 1.st NCCL log output Rank 0 Rank 1 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 37.
    How to StartHorovod Jobs Samples below, train.py is our training code: • 2 GPUs: mpirun -np 2 --allow-run-as-root -H localhost:2 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py • 4 GPUs: mpirun -np 4 --allow-run-as-root -H localhost:4 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py • Single GPU: mpirun -np 1 --allow-run-as-root -H localhost:1 -bind-to none -map-by slot -x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH -mca pml ob1 -mca btl ^openib python train.py A better way is to use horovodrun or ddlrun: One Training codebase, One way to start (MPI based), Easy to orchestrate, No parameter server IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation horovodrun -np 16 -H compute1:4,compute2:4,compute3:4,compute4:4 python train.py
  • 38.
    Horovod Rank Enumeration Parameters given by Horovod • hvd.size() – total amount of GPUs working in this job • hvd.rank() – rank id assigned to this specific tower/worker o Perform special steps in single rank (mostly rank 0) o Checkpointing o TensorBoard log writing  Pitfall: hvd.local_rank() is not unique when doing multi-node jobs! IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 39.
    Horovod MPI RankEnumeration A sample with 4 GPUs on 2 nodes Tower (indiv. process) Tower (indiv. process) Caller hvd.rank() = 0 hvd.local_rank() = 0 Tower (indiv. process) Tower (indiv. process) hvd.rank() = 1 hvd.local_rank() = 1 hvd.rank() = 2 hvd.local_rank() = 0 hvd.rank() = 3 hvd.local_rank() = 1 hvd.size() = 4 IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 40.
    WMLCE and SLURM integration 40IBM CognitiveSystems Europe / July 9 / © 2020 IBM Corporation
  • 41.
    41 SLURM template examplefor 4x IBM AC922s Batch AI IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 42.
    42 SLURM template examplefor 4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 43.
    43 MNIST example for4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 44.
    44 MNIST example for4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 45.
    45 MNIST example for4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 46.
    46 MNIST example for4x IBM AC922s Batch AI with Horovod and Pytorch IBM Cognitive Systems Europe / July 9 / © 2020 IBM Corporation
  • 47.
    47IBM Cognitive SystemsEurope / July 9 / © 2020 IBM Corporation
  • 48.
  • 49.

Editor's Notes

  • #10 Training models on GPU using Keras & Tensorflow is seamless. If you have an NVIDIA card and you have installed CUDA, the libraries will automatically detect it and use it for training. But what if you are a spoilt brat and you have multiple GPUs? Well unfortunately you will have to work a bit to achieve multi-GPU training.
  • #11 There are multiple ways to parallelise a network depending on what you want to achieve but the main two approaches is model and data parallelization. The first can help you if your model is too complex to fit in a single GPU while the latter helps when you want to speed up the execution. The main idea is that you pass your model through the method and it is copied across different GPUs. The original input is split into chunks which are fed to the various GPUs and then they are aggregated as a single output. This method can be used for achieving parallel training and predictions, nevertheless keep in mind that for training it does not scale linearly with the amount of GPUs due to the required synchronization.
  • #12 In synchronized data-parallel distributed deep learning, the major computation steps are: 1. Compute the gradient of the loss function using a minibatch on each GPU. 2. Compute the mean of the gradients by inter-GPU communication. 3. Update the model.