Wen Phan
April 20, 2017
Introduction to GPUs for
Machine Learning
Agenda
• Context and Why GPUs?
– Matrix Multiplication Example
• CUDA
• GPU and Machine Learning
– Deep Learning
– Parallel Computing: GBM, GLM
• Getting Started
• Others
Need for More Compute
• Lots of Data
• Complex Architectures
• Many Models
Historic Ways for More Compute
• Faster Clock Rates
• Multi-Core
• Distributed Computing
CPU Trends
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten,
dotted line extrapolations by C. Moore
Distributed Computing
Why GPUs?
GPU Accelerated Computing
GPUCPU
nVIDIA
GPUs for Parallel Tasks
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Megawatts
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
10x performance/socket
> 5x energy efficiency
Era of GPU-accelerated
computing is here
nVIDIA
GPU Devotes More Transistors to Data Processing
CUDA	C	Programming	Guide
CPU VS GPU
https://videocardz.com/39721/nvidia-geforce-gtx-titan-released
Latency Versus Throughput
• Latency: Time to do a task.
• Throughput: Number of tasks per unit time.
• Fictitious Example:
– CPU
• Latency: 1 ns per task
• Throughput: (1 task per ns) x (6 cores) = 6 task per ns
– GPU
• Latency: 10 ns per task
• Throughput: (0.1 task per ns) x (2000 cores) = 200 task per ns
• CPUs are latency optimized; GPUs are throughput optimized
NVIDIA Latest GPUs
http://www.anandtech.com/show/11172/nvidia-unveils-geforce-gtx-1080-ti-next-week-699
CUDA	C	Programming	Guide
Matrix Multiplication
Matrix Multiplication
2
4 A
3
5
m ⇥ k
2
4 B
3
5
k ⇥ n
=
2
4 C
3
5
m ⇥ n
Matrix Multiplication
2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
ci,j =
kX
h=1
ai,hbh,j
CUDA
CUDA
• Historically, GPUs were used for, well, graphics processing. But, people realized that the fine-
grained parallelism inherently in GPU architecture could be exploited for general purpose
computing.
• CUDA (Compute Unified Device Architecture)
– Parallel computing platform
– Programming model and API
– Allows enabled GPUs for general purpose processing
Speed Up Parallelizable Code
Application Code
GPU
Use GPU to
Parallelize
Compute-Intensive
Functions
CPU
Rest of Sequential
CPU Code
nVIDIA
Matrix Multiplication
CUDA	C	Programming	Guide
CUDA Matrix Multiplication
CUDA	C	Programming	Guide
CPU
GPU
CPU	
Memory
GPU	
Memory
CUDA	C	Programming	Guide
CUDA	C	Programming	Guide
CUDA	C	Programming	Guide
Matrix Multiplication Kernel
CUDA	C	Programming	Guide
CUDA Ecosystem
SC12	Demo:	Using	CUDA	Library	to	accelerate	applications
SC12	Demo:	Using	CUDA	Library	to	accelerate	applications
SC12	Demo:	Using	CUDA	Library	to	accelerate	applications
GPUs and Machine
Learning
GPUs and Machine Learning
• Poster Child: Deep Learning
• Parallel Computing
– Model Parallelism
– Data Parallelism
– Training Parallelism
Deep Learning
Multi-Layer Perceptron Neural Network
Efron	and	Hastie.	Computer	Age	Statistical	Inference.
MNIST
MNIST	Database
Image as a Tensor
TensorFlow
Training
TensorFlow
Supervised Learning
TensorFlow
Neuron
Activation Functions
Efron	and	Hastie.	Computer	Age	Statistical	Inference.
Layer of Neurons
Layer of Neurons
Matrix Multiplication!
Hidden Layers
Convolutional Neural Networks
• Leverages the fact that data has spatial structure
– Add idea of locality
• Tremendous success with computer vision tasks
• “Put deep learning on the map”
Frobenius Inner Product
X =
2
4
2 2 1
2 0 1
2 1 2
3
5 , K =
2
4
1 1 1
1 1 1
1 1 1
3
5
hX, KiF =
X
i,j
xi,jki,j
= (2 ⇤ 1) + (2 ⇤ 1) + (1 ⇤ 1) + (2 ⇤ 1) + (0 ⇤ 1) + (1 ⇤ 1) + (2 ⇤ 1) + (1 ⇤ 1) + (2 ⇤ 1)
= 2 + 2 + 1 2 + 0 1 2 + 1 + 2
= 3
Convolution Layer
Andrej	Karpathy.	CS231n	Convolutional	Neural	Networks	for	Visual	Recognition.
Convolutional Layer
w
d
h
g
0
@
X
d
X
i,j
xi,j,dki,j,d + b
1
A
X
Kf
X
d
X
i,j
xi,j,dki,j,d
¯
b
g(·)
¯
Convolutional Layer
w
d
h
X
d
X
i,j
xi,j,dki,j,d
b
g
0
@
X
d
X
i,j
xi,j,dki,j,d + b
1
A
g(·)
f
X
K
Convolutional Layer
w
d
h
X
Input Image Activation Maps
Convolution
Convolutional Layer
• f = receptive field
(filter size)
• p = padding
• s = stride
• m = number of filters
Input Volume Output Volume
Convolution
wI
hI
dI
wO
dO
hO
wO =
wI f + 2p
s
+ 1
hO =
wI f + 2p
s
+ 1
dO = m
Example: LeNet
Inception ResNet V2
ImageNet
cuDNN
ImageNet Results
http://kaiminghe.com/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf
ImageNet Entries Using GPUs
https://devblogs.nvidia.com/parallelforall/nvidia-ibm-cloud-support-imagenet-large-scale-visual-recognition-challenge/
Deep Water: Next-Gen Distributed Deep Learning
One Interface - GPU Enabled - Significant Performance Gains
Inherits All H2O Properties in Scalability, Ease of Use and Deployment
Recurrent Neural Networks
enabling natural language
processing, sequences, time series,
and more
Convolutional Neural Networks
enabling Image, video, speech
recognition
Hybrid Neural Network Architectures
enabling speech to text translation,
image captioning, scene parsing and
more
H2O integrates with existing GPU
backends for significant performance
gains
H2O Deep Learning Algo
Cat, Dog, or Mouse?
Parallel Computing
Parallel Computing
• Model Parallelism: Split up a single model
Random Forest
T1(x) T2(x) T3(x) TB(x)
ˆy = f(x; T1, . . . , TB)
Random Forest
T1(x) T2(x) T3(x) TB(x)
Deep Learning Model Parallelism
Large	Scale	Distributed	Deep	Networks.	J.	Dean,	et.	al.
Parallel Computing
• Model Parallelism: Split up a single model
• Data Parallelism: Split up data to train a single model
Deep Learning Data Parallelism
Large	Scale	Distributed	Deep	Networks.	J.	Dean,	et.	al.
H2O Deep Learning Architecture
Gradient Boosting Machine (GBM)
T1(x) T2(x) T3(x) TM (x)
fM (x) =
MX
i=1
Ti(x)
Gradient Boosting Machine (GBM)
T1(x) T2(x) T3(x) TM (x)
fi(x) = fi 1(x) + Ti(x; ˆ⇥i)
Gradient Boosting Machine (GBM)
T1(x) T2(x) T3(x) TM (x)
Decision Tree
R1
R2
Hastie,	Tibshirani,	Friedman.	Elements	of	Statistical	Learning
GBM Data Parallelism
1
2
K
x1 x2 x3 xp y
X = {X1, . . . , XK}
X1
X2
XK
{Xi; ti}?
GBM Data Parallelism
1
2
K
X = {X1, . . . , XK}
X1
X2
XK
{Xi; ti}?
GBM Data Parallelism
1
2
K
X = {X1, . . . , XK}
math (X1)
math (X2)
math (XK)
{Xi; ti} = f(math (X1) , . . . , math (XK))
GBM Data Parallelism
1
2
K
X = {X1, . . . , XK}
math (X1)
math (X2)
math (XK)
{Xi; ti} = f(math (X1) , . . . , math (XK))
Full	Data	Parallelism	for	Each	Level	of	Tree	Growth!
CPU Cluster
GPU Cluster
• Level GPUs to accelerate processing and fine-grain parallelism on each node
CPU
GPU
math (X1) math (X2) math (XK)math (X3)
GBM on GPU
T1(x) T2(x) T3(x) TM (x)
Application Code
GPU
Use GPU to
Compute-Intensive
Functions
CPU
Rest of Sequential
CPU Code
Parallel Computing
• Model Parallelism: Split up a single model
• Data Parallelism: Split up data to train a single model
• Training Parallelism: Split up different parts of the training process
– Ensemble Base Learners
– Cross-Validation
– Hyperparameters
Linear Regression
Hastie,	Tibshirani,	Friedman.	Elements	of	Statistical	Learning
X =
2
6
6
6
4
x1,1 x1,2 . . . x1,p
x2,1 x2,2 . . . x2,p
...
...
...
...
xn,1 xn,2 . . . xn,p
3
7
7
7
5
minimize
nX
i=1
0
@yi 0
pX
j=1
xi,j j
1
A
2
minimize ||y X ||2
2
Ridge Regression
Hastie,	Tibshirani,	Wainwright.	Statistical	Learning	with	Sparsity
minimize
0,
||y X ||2
2
subject to || ||2
2  t
Ridge Regression
Hastie,	Tibshirani,	Wainwright.	Statistical	Learning	with	Sparsity
minimize ||y X ||2
2 + || ||2
Lasso Regression
Hastie,	Tibshirani,	Wainwright.	Statistical	Learning	with	Sparsity
minimize
0,
||y X ||2
2
subject to || ||1  t
Lasso Regression
Hastie,	Tibshirani,	Wainwright.	Statistical	Learning	with	Sparsity
minimize ||y X ||2
2 + || ||1
Elastic Net Regression
minimize ||y X ||2
2 +
✓
↵|| ||1 +
1
2
(1 ↵)|| ||2
2
◆
Single Node GPU
CPU
GPU
Single Node Multi-GPU
CPU
GPU
Elastic Net Regression Training Parallelism
J( ; , ↵) = ||y X ||2
2 +
✓
↵|| ||1 +
1
2
(1 ↵)|| ||2
2
◆
minimize J( ; i, ↵1) minimize J( ; i, ↵2) minimize J( ; i, ↵G)
GPU1 GPU2 GPUG
i 2 {0, . . . , 1}
Elastic Net Regression Training Parallelism
Getting Started
CUDA
https://developer.nvidia.com/cuda-downloads
cuDNN
https://developer.nvidia.com/cudnn
Deep Water AMI
https://docs.h2o.ai
Others
Multi-GPU Multi-Node Cluster
Others
• Efficient CPU-GPU Utilization
• Communication Link and Overhead
NVLink
Others
• Efficient CPU-GPU Utilization
• Communication Link and Overhead
• Inference
GTC 2017: ML and AI on GPUs
Train Deep Learning Models Using H2O Deep Water
Questions?

Introduction to GPUs for Machine Learning