Distributed Deep
Learning Optimizations
GEETA CHAUHAN
Oct 14th – 15th, , 2017
Agenda  Distributed DL Challenges
 @ Scale DL Infrastructure
 Parallelize your models
 Techniques for Optimization
 Look into future
 References
Rise of Deep Learning
• Computer Vision, Language Translation,
Speech Recognition, Question & Answer,
…
Major Advances
in AI
• Latency, Cost, Power consumption issues
• Complexity & size outpacing commodity
“General purpose compute”
• Hyper-parameter tuning
Challenging to
build & deploy
for large scale
applications
Exascale, 15 Watts
3
Shift towards Specialized Compute
 Special purpose Cloud
 Google TPU, Microsoft Brainwave, IBM Power AI, Nvidia v100, Intel Nervana
 Spectrum: CPU, GPU, FPGA, Custom Asics
 Edge Compute: Hardware accelerators, AI SOC
 Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)
 Architectures
 Cluster Compute, HPC, Neuromorphic, Quantum compute
 Complexity in Software
 Model tuning/optimizations specific to hardware
 Growing need for compilers to optimize based on deployment hardware
 Workload specific compute: Model training, Inference
4
CPU Optimizations
 Leverage High Performant compute tools
 Intel Python, Intel Math Kernel Library (MKL), MKL-
DNN
 Compile Tensorflow from Source for CPU
Optimizations
 Proper Batch size, using all cores & memory
 Proper Data Format
 NCHW for CPUs vs Tensorflow default NHWC
 Use Queues for Reading Data
Source: Intel Research Blog
5
Tensorflow CPU Optimizations
 Compile from source
 git clone https://github.com/tensorflow/tensorflow.git
 Run ./configure from Tensorflow source directory
 Select option MKL (CPU) Optimization
 Build pip package for install
 bazel build --config=mkl --copt=-DEIGEN_USE_VML -c opt
//tensorflow/tools/pip_package:build_pip_package
 Install the optimized TensorFlow wheel
 bazel-bin/tensorflow/tools/pip_package/build_pip_package
~/path_to_save_wheel
pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl
 Intel Optimized Pip Wheel files
6
Parallelize your models
 Data Parallelism
 Tensorflow Estimator + Experiments
 Parameter Server, Worker cluster
 Intel BigDL Spark Cluster
 Baidu’s Ring AllReduce
 Uber’s Horovod TensorFusion
 HyperTune Google Cloud ML
 Model Parallelism
 Graph too large to fit on one
machine
 Tensorflow Model Towers
7
Optimizations for Training
Source: Amazon MxNET
8
Workload Partitioning
Source: Amazon MxNET
 Minimize communication time
 Place neighboring layers on same GPU
 Balance workload between GPUs
 Different layers have different memory-compute
properties
 Model on left more balanced
 LSTM unrolling: ↓ memory, ↑ compute time
 Encode/Decode: ↑ memory
9
Optimizations for Inferencing
 Graph Transform Tool
 Freeze graph (variables to constants)
 Quantization (32 bit float → 8 bit float)
 Quantize weights (20 M weights for IV3)
 Memory Mapping
 Inception v3 93 MB → 1.5 MB
10
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph 
--in_graph=/tmp/classify_image_graph_def.pb 
--outputs="softmax" --out_graph=/tmp/quantized_graph.pb 
--transforms='add_default_attributes strip_unused_nodes(type=float,
shape="1,299,299,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
strip_unused_nodes sort_by_execution_order'
Cluster
Optimizations
 Define your ML Container locally
 Evaluate with different parameters in the cloud
 Use EFS / GFS for data storage and sharing across
nodes
 Create separate Data processing container
 Mount EFS/GFS drive on all pods for shared
storage
 Avoid GPU Fragmentation problems by bundling
jobs
 Placement optimizations – Kubernetes Bundle
as pods, Mesos placement constraints
 GPU Drivers bundling in container a problem
 Mount as Readonly volume, or use Nvidia-
docker
11
Uber’s
Horovod on
Mesos
 Peleton Gang Scheduler
 MPI based bandwidth
optimized communication
 Code for one GPU, replicates
across cluster
 Nested Containers
12
Source: Uber Mesoscon
Future: FPGA Hardware Microservices
Source: Microsoft Research Blog
13
BrainWave Compiler & Runtime
Source: Microsoft Research Blog
14
Future: Neuromorphic Compute
Intel’s Loihi: Brain Inspired AI Chip Neuromorphic memristors
15
Future:
Quantum
Computers
Source: opentranscripts.org
Eg Personalized Medicine for diseases like Cancer
16
Resources
 A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and
Quantum Computers
https://arxiv.org/pdf/1703.05364.pdf
 Tensorflow Intel CPU Optimized: https://software.intel.com/en-us/articles/tensorflow-
optimizations-on-modern-intel-architecture
 Microsoft’s Project Brainwave: https://www.microsoft.com/en-us/research/blog/microsoft-
unveils-project-brainwave/
 Intel Spark BigDL: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-
apache-spark
 Baidu’s Paddle-Paddle on Kubernetes: http://blog.kubernetes.io/2017/02/run-deep-learning-
with-paddlepaddle-on-kubernetes.html
 Uber’s Horovod Distributed Training framework for Tensorflow: https://github.com/uber/horovod
 Kubernetes GPU Guide: https://github.com/Langhalsdino/Kubernetes-GPU-Guide
 Tensorflow Quantization: https://www.tensorflow.org/performance/quantization
 Training Deep Nets with Sublinear memory cost: https://arxiv.org/abs/1604.06174
17
Questions?
Contact
http://bit.ly/geeta4c
geeta@svsg.co
@geeta4c

Distributed deep learning optimizations - AI WithTheBest

  • 1.
    Distributed Deep Learning Optimizations GEETACHAUHAN Oct 14th – 15th, , 2017
  • 2.
    Agenda  DistributedDL Challenges  @ Scale DL Infrastructure  Parallelize your models  Techniques for Optimization  Look into future  References
  • 3.
    Rise of DeepLearning • Computer Vision, Language Translation, Speech Recognition, Question & Answer, … Major Advances in AI • Latency, Cost, Power consumption issues • Complexity & size outpacing commodity “General purpose compute” • Hyper-parameter tuning Challenging to build & deploy for large scale applications Exascale, 15 Watts 3
  • 4.
    Shift towards SpecializedCompute  Special purpose Cloud  Google TPU, Microsoft Brainwave, IBM Power AI, Nvidia v100, Intel Nervana  Spectrum: CPU, GPU, FPGA, Custom Asics  Edge Compute: Hardware accelerators, AI SOC  Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)  Architectures  Cluster Compute, HPC, Neuromorphic, Quantum compute  Complexity in Software  Model tuning/optimizations specific to hardware  Growing need for compilers to optimize based on deployment hardware  Workload specific compute: Model training, Inference 4
  • 5.
    CPU Optimizations  LeverageHigh Performant compute tools  Intel Python, Intel Math Kernel Library (MKL), MKL- DNN  Compile Tensorflow from Source for CPU Optimizations  Proper Batch size, using all cores & memory  Proper Data Format  NCHW for CPUs vs Tensorflow default NHWC  Use Queues for Reading Data Source: Intel Research Blog 5
  • 6.
    Tensorflow CPU Optimizations Compile from source  git clone https://github.com/tensorflow/tensorflow.git  Run ./configure from Tensorflow source directory  Select option MKL (CPU) Optimization  Build pip package for install  bazel build --config=mkl --copt=-DEIGEN_USE_VML -c opt //tensorflow/tools/pip_package:build_pip_package  Install the optimized TensorFlow wheel  bazel-bin/tensorflow/tools/pip_package/build_pip_package ~/path_to_save_wheel pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl  Intel Optimized Pip Wheel files 6
  • 7.
    Parallelize your models Data Parallelism  Tensorflow Estimator + Experiments  Parameter Server, Worker cluster  Intel BigDL Spark Cluster  Baidu’s Ring AllReduce  Uber’s Horovod TensorFusion  HyperTune Google Cloud ML  Model Parallelism  Graph too large to fit on one machine  Tensorflow Model Towers 7
  • 8.
  • 9.
    Workload Partitioning Source: AmazonMxNET  Minimize communication time  Place neighboring layers on same GPU  Balance workload between GPUs  Different layers have different memory-compute properties  Model on left more balanced  LSTM unrolling: ↓ memory, ↑ compute time  Encode/Decode: ↑ memory 9
  • 10.
    Optimizations for Inferencing Graph Transform Tool  Freeze graph (variables to constants)  Quantization (32 bit float → 8 bit float)  Quantize weights (20 M weights for IV3)  Memory Mapping  Inception v3 93 MB → 1.5 MB 10 bazel build tensorflow/tools/graph_transforms:transform_graph bazel-bin/tensorflow/tools/graph_transforms/transform_graph --in_graph=/tmp/classify_image_graph_def.pb --outputs="softmax" --out_graph=/tmp/quantized_graph.pb --transforms='add_default_attributes strip_unused_nodes(type=float, shape="1,299,299,3") remove_nodes(op=Identity, op=CheckNumerics) fold_constants(ignore_errors=true) fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes strip_unused_nodes sort_by_execution_order'
  • 11.
    Cluster Optimizations  Define yourML Container locally  Evaluate with different parameters in the cloud  Use EFS / GFS for data storage and sharing across nodes  Create separate Data processing container  Mount EFS/GFS drive on all pods for shared storage  Avoid GPU Fragmentation problems by bundling jobs  Placement optimizations – Kubernetes Bundle as pods, Mesos placement constraints  GPU Drivers bundling in container a problem  Mount as Readonly volume, or use Nvidia- docker 11
  • 12.
    Uber’s Horovod on Mesos  PeletonGang Scheduler  MPI based bandwidth optimized communication  Code for one GPU, replicates across cluster  Nested Containers 12 Source: Uber Mesoscon
  • 13.
    Future: FPGA HardwareMicroservices Source: Microsoft Research Blog 13
  • 14.
    BrainWave Compiler &Runtime Source: Microsoft Research Blog 14
  • 15.
    Future: Neuromorphic Compute Intel’sLoihi: Brain Inspired AI Chip Neuromorphic memristors 15
  • 16.
  • 17.
    Resources  A Studyof Complex Deep Learning Networks on High Performance, Neuromorphic, and Quantum Computers https://arxiv.org/pdf/1703.05364.pdf  Tensorflow Intel CPU Optimized: https://software.intel.com/en-us/articles/tensorflow- optimizations-on-modern-intel-architecture  Microsoft’s Project Brainwave: https://www.microsoft.com/en-us/research/blog/microsoft- unveils-project-brainwave/  Intel Spark BigDL: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on- apache-spark  Baidu’s Paddle-Paddle on Kubernetes: http://blog.kubernetes.io/2017/02/run-deep-learning- with-paddlepaddle-on-kubernetes.html  Uber’s Horovod Distributed Training framework for Tensorflow: https://github.com/uber/horovod  Kubernetes GPU Guide: https://github.com/Langhalsdino/Kubernetes-GPU-Guide  Tensorflow Quantization: https://www.tensorflow.org/performance/quantization  Training Deep Nets with Sublinear memory cost: https://arxiv.org/abs/1604.06174 17
  • 18.