"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

Copyright © 2017 MathWorks, Inc 1
Girish Venkataramani, Avinash Nehemiah
May 2017
Deep Learning and Vision Algorithm
Development in MATLAB Targeting
Embedded GPUs

Design Deep
Learning & Vision
Algorithms
Talk Outline
High Performance
Embedded
Implementation
Highlights
• Manage large image sets
• Automate image labeling
• Easy access to models
• Pre-built training
frameworks
Highlights
• Automate compilation of
MATLAB to CUDA
• 14x faster than pyCaffe
60% faster than C++ Caffe
3x faster than TensorFlow
Accelerate and Scale
Training
Highlights
• Acceleration with GPUs
• Scale to clusters

Let’s Use Object Detection as an Example
TRUCK
SUV
CAR
In our example we’ll use deep learning for object detection.

Transfer Learning Workflow
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Training Data
Labels: Car, Truck,
Large Truck, SUV, Van
Alexnet, VGG-16,
VGG-19, GoogLeNet

Manage Large Sets of Images
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Handle Large Sets of Images
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
imageData = imageDataStore(‘vehicles’)
Organize Images in Folders
(~ 10,000 images , 5 folders)

Automate Ground Truth Labeling
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Ground Truth Labeling

Access Reference Models in MATLAB
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Easily Load Reference Networks
Access Models with 1-line of MATLAB Code
Net1 = alexnet
Net2 = vgg16
Net3 = vgg19

Access Reference Models in MATLAB
1. Reference Models
2. Model Importer
3. Tutorials

Modify Network Structure
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Simple MATLAB API to modify layers:
layers(23) = fullyConnectedLayer(5, 'Name','fc8');
layers(25) = classificationLayer('Name',‘VehicleClassifier')

Training Object Detectors
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Train Any Network
trainNetwork(datastore, layers, options)
Pre-built Frameworks for Computer Vision
• Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN
• Machine Learning: ACF, Cascade Object Detectors

Visualizing and Debugging Intermediate Results
Filters
…
Activations
Deep Dream
Training Accuracy
Visualization
Deep Dream
Layer Activations Feature Visualization
• Many options for visualizations and debugging
• Examples to get started

Real World Systems Use More Than
Deep Learning
Deep learning vehicle detector performance degraded with environmental effects (fog, etc. )
Fog Removal
Challenge: Deep learning frameworks do not include “classical” computer vision
Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation

Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Embedded
Implementation
Training
Can you solve “real” problems for production
systems with MATLAB?

Single code change
trainingOptions(‘sgdm’,…
‘ExecutionEnvironment’,’CPU’)
Accelerate and Scale Computing
Multi-core CPU
‘ExecutionEnvironment’,’GPU’)
GPU
‘ExecutionEnvironment’,’multi-GPU’) Multiple
GPU
‘ExecutionEnvironment’,’parallel’)
Cluster/
Cloud

After Many Iterations to Find The Best Model

Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Embedded
Implementation
Training
Can you create high performance implementation
from MATLAB code ?

Presenting the MATLAB to CUDA parallelizing compiler
Why?
• Alexnet inference using MATLAB solution is
• ~14x faster than pyCaffe and 50% faster than C++-Caffe
• ~ 4x faster and ~3x less memory-use than TensorFlow

Sample Generated CUDA Code
MATLAB source code Auto-generated CUDA code

MATLAB to CUDA compiler flow
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Parallel loop creation
Library function mapping
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA code emission
….
Traditional compiler
optimizations
….
(×) cublas-gemm
() cuSolver calls
fft cuFFT calls
nnet cuDNN calls
Identify loop-nests that will
become CUDA kernels
…
.
Convert loop to CUDA kernel
Thread/blocks inferred from loop dims
cudaMemcpy
minimization
Perform Use-def analysis.
cudaMalloc GPU vars, insert memcpy
Infer data locality. Map to shared
memory. Synthesize shared memory
access
CUDA kernel
optimizations

MATLAB to CUDA compiler:
Creating large parallel loops!
Control-flow graph
(CFG – IR)
Front-end
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
CUDA code emission
….
optimizations
…
.
Loop
optimizations
Scalarization
Loop fusion
Scalar replacement
cudaMemcpy
minimization
CUDA kernel
optimizations

Creating large parallel loops!
Control-flow graph
(CFG – IR)
Front-end
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
CUDA code emission
….
optimizations
…
.
Loop
optimizations
2 kernels (size N), 20*N bytes
1 kernel (size N), 16*N bytes
Scalarization
Loop fusion
Scalar replacement
cudaMemcpy
minimization
CUDA kernel
optimizations

cudaMemcpy minimization
A(:) = ….
C(:) = ….
for i = 1:N
….
gB = kernel1(gA);
gA = kernel2(gB);
if (some_condition)
gC = kernel3(gA, gB);
end
….
end
…. = C;
cudaMemcpy
*definitely* needed
cudaMemcpy
*not* needed
cudaMemcpy
*may be* needed
Observations
• Equivalent to Partial redundancy elimination (PRE)
• Dynamic strategy – track memory location with a
status flag per variable
• Use-Def to determine where to insert memcpy
A(:) = …
A_isDirtyOnCpu = true;
…
for i = 1:N
if (A_isDirtyOnCpu)
cudaMemcpy(gA, A);
A_isDirtyOnCpu = false;
end
gB = kernel1(gA);
gA = kernel2(gB);
if (somecondition)
gC = kernel3(gA, gB);
C_isDirtyOnGpu = true;
end
…
end
…
if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);
C_isDirtyOnGpu = false;
end
… = C;
Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code

Example: Compiling fog-rectification algorithm

MATLAB to CUDA compilation of computer vision
applications
Distance
transform
Fog removal
SURF feature
extraction
Ray tracing
Stereo disparity

Deep learning prediction performance: Alexnet
Framerate(Fps)
Batch Size
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz
GPU Tesla K40c
0
200
400
600
800
1000
1200
1400
1 16 32 64
Py-Caffe
TensorFlow

0
1
2
3
4
5
6
7
8
9
CPU resident memory GPU peak memory (nvidia-smi)
Memoryusage(GB)
Batch Size
1 16 32 64
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz
GPU Tesla K40c
Py-Caffe
MATLABtoCUDAcompiler
TensorFlow
MATLABonCPU+GPU
C++-Caffe

Jetson (Tegra) TX1
0
50
100
150
200
250
1 16 32 64 128
Framerate(Fps)
Batch Size
C++-Caffe
MATLAB to CUDA
compiler

Create CNNs with MATLAB,
Deploy with MATLAB to CUDA compiler
Alexnet YOLO
People detection Lane detection
~20 Fps (K40c)
~30 Fps
(Tegra X1)
~66 Fps
(Tegra X1)
(K40c)

Conclusions
Design Deep
Learning & Vision
Algorithm
Training
Deep learning design is
easy in MATLAB
Managing datasets and
scaling up training is easy
in MATLAB
MATLAB to CUDA compiler
10x – 14x faster than pyCaffe
1.3x – 4x faster than TensorFlow
1.07 – 1.6x faster than C++ Caffe
High Performance
Embedded
Implementation

What next?
www.mathworks.com/matlab-cuda-beta
Sign up for our beta program
Try deep learning in MATLAB
Visit our booth and see our demos
Booth #: 808

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

More Related Content

What's hot

Viewers also liked

Similar to "Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

More from Edge AI and Vision Alliance

Recently uploaded

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks