Introduction to GPUs for Machine Learning

Wen Phan
April 20, 2017
Introduction to GPUs for
Machine Learning

Agenda
• Context and Why GPUs?
– Matrix Multiplication Example
• CUDA
• GPU and Machine Learning
– Deep Learning
– Parallel Computing: GBM, GLM
• Getting Started
• Others

Need for More Compute
• Lots of Data
• Complex Architectures
• Many Models

Historic Ways for More Compute
• Faster Clock Rates
• Multi-Core
• Distributed Computing

CPU Trends
Original data collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten,
dotted line extrapolations by C. Moore

GPU Accelerated Computing
GPUCPU
nVIDIA

GPUs for Parallel Tasks
Traditional CPUs are
not economically feasible
2.3 PFlops 7000 homes
7.0
Megawatts
7.0
Megawatts
CPU
Optimized for
Serial Tasks
GPU Accelerator
Optimized for Many
Parallel Tasks
10x performance/socket
> 5x energy efficiency
Era of GPU-accelerated
computing is here
nVIDIA

GPU Devotes More Transistors to Data Processing
CUDA C Programming Guide

CPU VS GPU
https://videocardz.com/39721/nvidia-geforce-gtx-titan-released

Latency Versus Throughput
• Latency: Time to do a task.
• Throughput: Number of tasks per unit time.
• Fictitious Example:
– CPU
• Latency: 1 ns per task
• Throughput: (1 task per ns) x (6 cores) = 6 task per ns
– GPU
• Latency: 10 ns per task
• Throughput: (0.1 task per ns) x (2000 cores) = 200 task per ns
• CPUs are latency optimized; GPUs are throughput optimized

NVIDIA Latest GPUs
http://www.anandtech.com/show/11172/nvidia-unveils-geforce-gtx-1080-ti-next-week-699

Matrix Multiplication
2
4 A
3
5
m ⇥ k
2
4 B
3
5
k ⇥ n
=
2
4 C
3
5
m ⇥ n

2
6
6
6
6
6
4
a1,1 a1,2 a1,3 . . . a1,k
a2,1 a2,2 a2,3 . . . a2,k
a3,1 a3,2 a3,3 . . . a3,k
...
...
...
...
...
am,1 am,2 am,3 . . . am,k
3
7
7
7
7
7
5
A
2
6
6
6
6
6
4
b1,1 b1,2 b1,3 . . . b1,n
b2,1 b2,2 b2,3 . . . b2,n
b3,1 b3,2 b3,3 . . . b3,n
...
...
...
...
...
bk,1 bk,2 bk,3 . . . bk,n
3
7
7
7
7
7
5
B
=
2
6
6
6
6
6
4
c1,1 c1,2 c1,3 . . . c1,n
c2,1 c2,2 c2,3 . . . c2,n
c3,1 c3,2 c3,3 . . . c3,n
...
...
...
...
...
cm,1 cm,2 cm,3 . . . cm,n
3
7
7
7
7
7
5
C
ci,j =
kX
h=1
ai,hbh,j

CUDA
• Historically, GPUs were used for, well, graphics processing. But, people realized that the fine-
grained parallelism inherently in GPU architecture could be exploited for general purpose
computing.
• CUDA (Compute Unified Device Architecture)
– Parallel computing platform
– Programming model and API
– Allows enabled GPUs for general purpose processing

Speed Up Parallelizable Code
Application Code
GPU
Use GPU to
Parallelize
Compute-Intensive
Functions
CPU
Rest of Sequential
CPU Code
nVIDIA

CUDA Matrix Multiplication
CPU
GPU
CPU
Memory
GPU
Memory

Matrix Multiplication Kernel

SC12 Demo: Using CUDA Library to accelerate applications

GPUs and Machine Learning
• Poster Child: Deep Learning
• Parallel Computing
– Model Parallelism
– Data Parallelism
– Training Parallelism

Multi-Layer Perceptron Neural Network
Efron and Hastie. Computer Age Statistical Inference.

Supervised Learning
TensorFlow

Activation Functions
Efron and Hastie. Computer Age Statistical Inference.

Convolutional Neural Networks
• Leverages the fact that data has spatial structure
– Add idea of locality
• Tremendous success with computer vision tasks
• “Put deep learning on the map”

Frobenius Inner Product
X =
2
4
2 2 1
2 0 1
2 1 2
3
5 , K =
2
4
1 1 1
1 1 1
1 1 1
3
5
hX, KiF =
X
i,j
xi,jki,j
= (2 ⇤ 1) + (2 ⇤ 1) + (1 ⇤ 1) + (2 ⇤ 1) + (0 ⇤ 1) + (1 ⇤ 1) + (2 ⇤ 1) + (1 ⇤ 1) + (2 ⇤ 1)
= 2 + 2 + 1 2 + 0 1 2 + 1 + 2
= 3

Convolution Layer
Andrej Karpathy. CS231n Convolutional Neural Networks for Visual Recognition.

Convolutional Layer
w
d
h
g
0
@
X
d
X
i,j
xi,j,dki,j,d + b
1
A
X
Kf
X
d
X
i,j
xi,j,dki,j,d
¯
b
g(·)

¯
Convolutional Layer
w
d
h
X
d
X
i,j
xi,j,dki,j,d
b
g
0
@
X
d
X
i,j
xi,j,dki,j,d + b
1
A
g(·)
f
X
K

Convolutional Layer
w
d
h
X
Input Image Activation Maps
Convolution

Convolutional Layer
• f = receptive field
(filter size)
• p = padding
• s = stride
• m = number of filters
Input Volume Output Volume
Convolution
wI
hI
dI
wO
dO
hO
wO =
wI f + 2p
s
+ 1
hO =
wI f + 2p
s
+ 1
dO = m

ImageNet Results
http://kaiminghe.com/ilsvrc15/ilsvrc2015_deep_residual_learning_kaiminghe.pdf

ImageNet Entries Using GPUs
https://devblogs.nvidia.com/parallelforall/nvidia-ibm-cloud-support-imagenet-large-scale-visual-recognition-challenge/

Deep Water: Next-Gen Distributed Deep Learning
One Interface - GPU Enabled - Significant Performance Gains
Inherits All H2O Properties in Scalability, Ease of Use and Deployment
Recurrent Neural Networks
enabling natural language
processing, sequences, time series,
and more
Convolutional Neural Networks
enabling Image, video, speech
recognition
Hybrid Neural Network Architectures
enabling speech to text translation,
image captioning, scene parsing and
more
H2O integrates with existing GPU
backends for significant performance
gains
H2O Deep Learning Algo

Parallel Computing
• Model Parallelism: Split up a single model

Random Forest
T1(x) T2(x) T3(x) TB(x)
ˆy = f(x; T1, . . . , TB)

Random Forest
T1(x) T2(x) T3(x) TB(x)

Deep Learning Model Parallelism
Large Scale Distributed Deep Networks. J. Dean, et. al.

Parallel Computing
• Data Parallelism: Split up data to train a single model

Deep Learning Data Parallelism
Large Scale Distributed Deep Networks. J. Dean, et. al.

H2O Deep Learning Architecture

Gradient Boosting Machine (GBM)
T1(x) T2(x) T3(x) TM (x)
fM (x) =
MX
i=1
Ti(x)

T1(x) T2(x) T3(x) TM (x)
fi(x) = fi 1(x) + Ti(x; ˆ⇥i)

T1(x) T2(x) T3(x) TM (x)

Decision Tree
R1
R2
Hastie, Tibshirani, Friedman. Elements of Statistical Learning

GBM Data Parallelism
1
2
K
x1 x2 x3 xp y
X = {X1, . . . , XK}
X1
X2
XK
{Xi; ti}?

1
2
K
X = {X1, . . . , XK}
X1
X2
XK
{Xi; ti}?

1
2
K
X = {X1, . . . , XK}
math (X1)
math (X2)
math (XK)
{Xi; ti} = f(math (X1) , . . . , math (XK))

1
2
K
X = {X1, . . . , XK}
math (X1)
math (X2)
math (XK)
{Xi; ti} = f(math (X1) , . . . , math (XK))
Full Data Parallelism for Each Level of Tree Growth!

GPU Cluster
• Level GPUs to accelerate processing and fine-grain parallelism on each node
CPU
GPU
math (X1) math (X2) math (XK)math (X3)

GBM on GPU
T1(x) T2(x) T3(x) TM (x)
Application Code
GPU
Use GPU to
Compute-Intensive
Functions
CPU
Rest of Sequential
CPU Code

Parallel Computing
• Data Parallelism: Split up data to train a single model
• Training Parallelism: Split up different parts of the training process
– Ensemble Base Learners
– Cross-Validation
– Hyperparameters

Linear Regression
Hastie, Tibshirani, Friedman. Elements of Statistical Learning
X =
2
6
6
6
4
x1,1 x1,2 . . . x1,p
x2,1 x2,2 . . . x2,p
...
...
...
...
xn,1 xn,2 . . . xn,p
3
7
7
7
5
minimize
nX
i=1
0
@yi 0
pX
j=1
xi,j j
1
A
2
minimize ||y X ||2
2

Ridge Regression
Hastie, Tibshirani, Wainwright. Statistical Learning with Sparsity
minimize
0,
||y X ||2
2
subject to || ||2
2  t

Ridge Regression
minimize ||y X ||2
2 + || ||2

Lasso Regression
minimize
0,
||y X ||2
2
subject to || ||1  t

Lasso Regression
minimize ||y X ||2
2 + || ||1

Elastic Net Regression
minimize ||y X ||2
2 +
✓
↵|| ||1 +
1
2
(1 ↵)|| ||2
2
◆

Elastic Net Regression Training Parallelism
J( ; , ↵) = ||y X ||2
2 +
✓
↵|| ||1 +
1
2
(1 ↵)|| ||2
2
◆
minimize J( ; i, ↵1) minimize J( ; i, ↵2) minimize J( ; i, ↵G)
GPU1 GPU2 GPUG
i 2 {0, . . . , 1}

Elastic Net Regression Training Parallelism

CUDA
https://developer.nvidia.com/cuda-downloads

cuDNN
https://developer.nvidia.com/cudnn

Deep Water AMI
https://docs.h2o.ai

Others
• Efficient CPU-GPU Utilization
• Communication Link and Overhead

Others
• Efficient CPU-GPU Utilization
• Communication Link and Overhead
• Inference

Train Deep Learning Models Using H2O Deep Water

Introduction to GPUs for Machine Learning

More Related Content

What's hot

Similar to Introduction to GPUs for Machine Learning

More from Sri Ambati

Recently uploaded

Introduction to GPUs for Machine Learning