Distributed deep learning optimizations - AI WithTheBest

Distributed Deep
Learning Optimizations
GEETA CHAUHAN
Oct 14th – 15th, , 2017

Agenda  Distributed DL Challenges
 @ Scale DL Infrastructure
 Parallelize your models
 Techniques for Optimization
 Look into future
 References

Rise of Deep Learning
• Computer Vision, Language Translation,
Speech Recognition, Question & Answer,
…
Major Advances
in AI
• Latency, Cost, Power consumption issues
• Complexity & size outpacing commodity
“General purpose compute”
• Hyper-parameter tuning
Challenging to
build & deploy
for large scale
applications
Exascale, 15 Watts
3

Shift towards Specialized Compute
 Special purpose Cloud
 Google TPU, Microsoft Brainwave, IBM Power AI, Nvidia v100, Intel Nervana
 Spectrum: CPU, GPU, FPGA, Custom Asics
 Edge Compute: Hardware accelerators, AI SOC
 Intel Neural Compute Stick, Nvidia Jetson, Nvidia Drive PX (Self driving cars)
 Architectures
 Cluster Compute, HPC, Neuromorphic, Quantum compute
 Complexity in Software
 Model tuning/optimizations specific to hardware
 Growing need for compilers to optimize based on deployment hardware
 Workload specific compute: Model training, Inference
4

CPU Optimizations
 Leverage High Performant compute tools
 Intel Python, Intel Math Kernel Library (MKL), MKL-
DNN
 Compile Tensorflow from Source for CPU
Optimizations
 Proper Batch size, using all cores & memory
 Proper Data Format
 NCHW for CPUs vs Tensorflow default NHWC
 Use Queues for Reading Data
Source: Intel Research Blog
5

Tensorflow CPU Optimizations
 Compile from source
 git clone https://github.com/tensorflow/tensorflow.git
 Run ./configure from Tensorflow source directory
 Select option MKL (CPU) Optimization
 Build pip package for install
 bazel build --config=mkl --copt=-DEIGEN_USE_VML -c opt
//tensorflow/tools/pip_package:build_pip_package
 Install the optimized TensorFlow wheel
 bazel-bin/tensorflow/tools/pip_package/build_pip_package
~/path_to_save_wheel
pip install --upgrade --user ~/path_to_save_wheel /wheel_name.whl
 Intel Optimized Pip Wheel files
6

Parallelize your models
 Data Parallelism
 Tensorflow Estimator + Experiments
 Parameter Server, Worker cluster
 Intel BigDL Spark Cluster
 Baidu’s Ring AllReduce
 Uber’s Horovod TensorFusion
 HyperTune Google Cloud ML
 Model Parallelism
 Graph too large to fit on one
machine
 Tensorflow Model Towers
7

Optimizations for Training
Source: Amazon MxNET
8

Workload Partitioning
Source: Amazon MxNET
 Minimize communication time
 Place neighboring layers on same GPU
 Balance workload between GPUs
 Different layers have different memory-compute
properties
 Model on left more balanced
 LSTM unrolling: ↓ memory, ↑ compute time
 Encode/Decode: ↑ memory
9

Optimizations for Inferencing
 Graph Transform Tool
 Freeze graph (variables to constants)
 Quantization (32 bit float → 8 bit float)
 Quantize weights (20 M weights for IV3)
 Memory Mapping
 Inception v3 93 MB → 1.5 MB
10
bazel build tensorflow/tools/graph_transforms:transform_graph
bazel-bin/tensorflow/tools/graph_transforms/transform_graph
--in_graph=/tmp/classify_image_graph_def.pb
--outputs="softmax" --out_graph=/tmp/quantized_graph.pb
--transforms='add_default_attributes strip_unused_nodes(type=float,
shape="1,299,299,3")
remove_nodes(op=Identity, op=CheckNumerics)
fold_constants(ignore_errors=true)
fold_batch_norms fold_old_batch_norms quantize_weights quantize_nodes
strip_unused_nodes sort_by_execution_order'

Cluster
Optimizations
 Define your ML Container locally
 Evaluate with different parameters in the cloud
 Use EFS / GFS for data storage and sharing across
nodes
 Create separate Data processing container
 Mount EFS/GFS drive on all pods for shared
storage
 Avoid GPU Fragmentation problems by bundling
jobs
 Placement optimizations – Kubernetes Bundle
as pods, Mesos placement constraints
 GPU Drivers bundling in container a problem
 Mount as Readonly volume, or use Nvidia-
docker
11

Uber’s
Horovod on
Mesos
 Peleton Gang Scheduler
 MPI based bandwidth
optimized communication
 Code for one GPU, replicates
across cluster
 Nested Containers
12
Source: Uber Mesoscon

Future: FPGA Hardware Microservices
Source: Microsoft Research Blog
13

BrainWave Compiler & Runtime
Source: Microsoft Research Blog
14

Future: Neuromorphic Compute
Intel’s Loihi: Brain Inspired AI Chip Neuromorphic memristors
15

Future:
Quantum
Computers
Source: opentranscripts.org
Eg Personalized Medicine for diseases like Cancer
16

Resources
 A Study of Complex Deep Learning Networks on High Performance, Neuromorphic, and
Quantum Computers
https://arxiv.org/pdf/1703.05364.pdf
 Tensorflow Intel CPU Optimized: https://software.intel.com/en-us/articles/tensorflow-
optimizations-on-modern-intel-architecture
 Microsoft’s Project Brainwave: https://www.microsoft.com/en-us/research/blog/microsoft-
unveils-project-brainwave/
 Intel Spark BigDL: https://software.intel.com/en-us/articles/bigdl-distributed-deep-learning-on-
apache-spark
 Baidu’s Paddle-Paddle on Kubernetes: http://blog.kubernetes.io/2017/02/run-deep-learning-
with-paddlepaddle-on-kubernetes.html
 Uber’s Horovod Distributed Training framework for Tensorflow: https://github.com/uber/horovod
 Kubernetes GPU Guide: https://github.com/Langhalsdino/Kubernetes-GPU-Guide
 Tensorflow Quantization: https://www.tensorflow.org/performance/quantization
 Training Deep Nets with Sublinear memory cost: https://arxiv.org/abs/1604.06174
17

Questions?
Contact
http://bit.ly/geeta4c
geeta@svsg.co
@geeta4c

Distributed deep learning optimizations - AI WithTheBest

More Related Content

What's hot

Similar to Distributed deep learning optimizations - AI WithTheBest

More from geetachauhan

Recently uploaded

Distributed deep learning optimizations - AI WithTheBest