Copyright © 2017 MathWorks, Inc 1
Girish Venkataramani, Avinash Nehemiah
May 2017
Deep Learning and Vision Algorithm
Development in MATLAB Targeting
Embedded GPUs
Copyright © 2017 MathWorks, Inc 2
Design Deep
Learning & Vision
Algorithms
Talk Outline
High Performance
Embedded
Implementation
Highlights
• Manage large image sets
• Automate image labeling
• Easy access to models
• Pre-built training
frameworks
Highlights
• Automate compilation of
MATLAB to CUDA
• 14x faster than pyCaffe
60% faster than C++ Caffe
3x faster than TensorFlow
Accelerate and Scale
Training
Highlights
• Acceleration with GPUs
• Scale to clusters
Copyright © 2017 MathWorks, Inc 3
Let’s Use Object Detection as an Example
TRUCK
SUV
CAR
In our example we’ll use deep learning for object detection.
Copyright © 2017 MathWorks, Inc 5
Transfer Learning Workflow
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Training Data
Labels: Car, Truck,
Large Truck, SUV, Van
Alexnet, VGG-16,
VGG-19, GoogLeNet
Copyright © 2017 MathWorks, Inc 6
Manage Large Sets of Images
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Handle Large Sets of Images
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
imageData = imageDataStore(‘vehicles’)
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
Organize Images in Folders
(~ 10,000 images , 5 folders)
Copyright © 2017 MathWorks, Inc 7
Automate Ground Truth Labeling
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Ground Truth Labeling
Copyright © 2017 MathWorks, Inc 8
Automate Ground Truth Labeling
Automate Ground Truth Labeling
Copyright © 2017 MathWorks, Inc 9
Access Reference Models in MATLAB
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Easily Load Reference Networks
Access Models with 1-line of MATLAB Code
Net1 = alexnet
Net2 = vgg16
Net3 = vgg19
Copyright © 2017 MathWorks, Inc 10
Access Reference Models in MATLAB
Easily manage large sets of images
- Single line of code to access images
- Operates on disk, database, big-data file system
1. Reference Models
2. Model Importer
3. Tutorials
Copyright © 2017 MathWorks, Inc 11
Modify Network Structure
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Simple MATLAB API to modify layers:
layers(23) = fullyConnectedLayer(5, 'Name','fc8');
layers(25) = classificationLayer('Name',‘VehicleClassifier')
Copyright © 2017 MathWorks, Inc 12
Training Object Detectors
Transfer Learning
Images
New
Classifier
Learn New
Weights
Modify
Network
Structure
Load
Reference
NetworkLabels
Train Any Network
trainNetwork(datastore, layers, options)
Pre-built Frameworks for Computer Vision
• Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN
• Machine Learning: ACF, Cascade Object Detectors
Copyright © 2017 MathWorks, Inc 13
Visualizing and Debugging Intermediate Results
Filters
…
Activations
Deep Dream
Training Accuracy
Visualization
Deep Dream
Layer Activations Feature Visualization
• Many options for visualizations and debugging
• Examples to get started
Copyright © 2017 MathWorks, Inc 14
Real World Systems Use More Than
Deep Learning
Deep learning vehicle detector performance degraded with environmental effects (fog, etc. )
Fog Removal
Challenge: Deep learning frameworks do not include “classical” computer vision
Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
Copyright © 2017 MathWorks, Inc 15
Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Embedded
Implementation
Accelerate and Scale
Training
Can you solve “real” problems for production
systems with MATLAB?
Copyright © 2017 MathWorks, Inc 16
Single code change
trainingOptions(‘sgdm’,…
‘ExecutionEnvironment’,’CPU’)
Accelerate and Scale Computing
Multi-core CPU
‘ExecutionEnvironment’,’GPU’)
GPU
‘ExecutionEnvironment’,’multi-GPU’) Multiple
GPU
‘ExecutionEnvironment’,’parallel’)
Cluster/
Cloud
Copyright © 2017 MathWorks, Inc 17
After Many Iterations to Find The Best Model
Copyright © 2017 MathWorks, Inc 18
Talk Outline
Design Deep
Learning & Vision
Algorithms
High Performance
Embedded
Implementation
Accelerate and Scale
Training
Can you create high performance implementation
from MATLAB code ?
Copyright © 2017 MathWorks, Inc 19
Presenting the MATLAB to CUDA parallelizing compiler
Why?
• Alexnet inference using MATLAB solution is
• ~14x faster than pyCaffe and 50% faster than C++-Caffe
• ~ 4x faster and ~3x less memory-use than TensorFlow
Copyright © 2017 MathWorks, Inc 20
Sample Generated CUDA Code
MATLAB source code Auto-generated CUDA code
Copyright © 2017 MathWorks, Inc 21
MATLAB to CUDA compiler flow
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Parallel loop creation
Library function mapping
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA code emission
….
Traditional compiler
optimizations
….
(×) cublas-gemm
() cuSolver calls
fft cuFFT calls
nnet cuDNN calls
Library function mapping
Parallel loop creation
Identify loop-nests that will
become CUDA kernels
…
.
CUDA kernel creation
Convert loop to CUDA kernel
Thread/blocks inferred from loop dims
cudaMemcpy
minimization
Shared memory synthesis
Perform Use-def analysis.
cudaMalloc GPU vars, insert memcpy
Infer data locality. Map to shared
memory. Synthesize shared memory
access
CUDA kernel
optimizations
Copyright © 2017 MathWorks, Inc 22
MATLAB to CUDA compiler:
Creating large parallel loops!
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Library function mapping
CUDA code emission
….
Traditional compiler
optimizations
…
.
Loop
optimizations
Scalarization
Loop fusion
Scalar replacement
Parallel loop creation
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA kernel
optimizations
Copyright © 2017 MathWorks, Inc 23
MATLAB to CUDA compiler:
Creating large parallel loops!
Control-flow graph
Intermediate representation
(CFG – IR)
Front-end
Scalarization
Loop perfectization
Loop interchange
Loop fusion
Scalar replacement
Library function mapping
CUDA code emission
….
Traditional compiler
optimizations
…
.
Loop
optimizations
2 kernels (size N), 20*N bytes
1 kernel (size N), 16*N bytes
Scalarization
Loop fusion
Scalar replacement
Parallel loop creation
CUDA kernel creation
cudaMemcpy
minimization
Shared memory synthesis
CUDA kernel
optimizations
Copyright © 2017 MathWorks, Inc 24
cudaMemcpy minimization
A(:) = ….
C(:) = ….
for i = 1:N
….
gB = kernel1(gA);
gA = kernel2(gB);
if (some_condition)
gC = kernel3(gA, gB);
end
….
end
…. = C;
cudaMemcpy
*definitely* needed
cudaMemcpy
*not* needed
cudaMemcpy
*may be* needed
Observations
• Equivalent to Partial redundancy elimination (PRE)
• Dynamic strategy – track memory location with a
status flag per variable
• Use-Def to determine where to insert memcpy
A(:) = …
A_isDirtyOnCpu = true;
…
for i = 1:N
if (A_isDirtyOnCpu)
cudaMemcpy(gA, A);
A_isDirtyOnCpu = false;
end
gB = kernel1(gA);
gA = kernel2(gB);
if (somecondition)
gC = kernel3(gA, gB);
C_isDirtyOnGpu = true;
end
…
end
…
if (C_isDirtyOnGpu)
cudaMemcpy(C, gC);
C_isDirtyOnGpu = false;
end
… = C;
Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code
Copyright © 2017 MathWorks, Inc 25
Example: Compiling fog-rectification algorithm
Copyright © 2017 MathWorks, Inc 26
MATLAB to CUDA compilation of computer vision
applications
Distance
transform
Fog removal
SURF feature
extraction
Ray tracing
Stereo disparity
Copyright © 2017 MathWorks, Inc 27
Deep learning prediction performance: Alexnet
Framerate(Fps)
Batch Size
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz
GPU Tesla K40c
0
200
400
600
800
1000
1200
1400
1 16 32 64
Py-Caffe
TensorFlow
Copyright © 2017 MathWorks, Inc 28
Deep learning prediction performance: Alexnet
0
1
2
3
4
5
6
7
8
9
CPU resident memory GPU peak memory (nvidia-smi)
Memoryusage(GB)
Batch Size
1 16 32 64
CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz
GPU Tesla K40c
Py-Caffe
MATLABtoCUDAcompiler
TensorFlow
MATLABonCPU+GPU
C++-Caffe
Copyright © 2017 MathWorks, Inc 29
Deep learning prediction performance: Alexnet
Jetson (Tegra) TX1
0
50
100
150
200
250
1 16 32 64 128
Framerate(Fps)
Batch Size
C++-Caffe
MATLAB to CUDA
compiler
Copyright © 2017 MathWorks, Inc 30
Create CNNs with MATLAB,
Deploy with MATLAB to CUDA compiler
Alexnet YOLO
People detection Lane detection
~20 Fps (K40c)
~30 Fps
(Tegra X1)
~66 Fps
(Tegra X1)
(K40c)
Copyright © 2017 MathWorks, Inc 31
Conclusions
Design Deep
Learning & Vision
Algorithm
Accelerate and Scale
Training
Deep learning design is
easy in MATLAB
Managing datasets and
scaling up training is easy
in MATLAB
MATLAB to CUDA compiler
10x – 14x faster than pyCaffe
1.3x – 4x faster than TensorFlow
1.07 – 1.6x faster than C++ Caffe
High Performance
Embedded
Implementation
Copyright © 2017 MathWorks, Inc 32
What next?
www.mathworks.com/matlab-cuda-beta
MATLAB to CUDA compiler:
Sign up for our beta program
Try deep learning in MATLAB
Visit our booth and see our demos
Booth #: 808

"Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs," a Presentation from MathWorks

  • 1.
    Copyright © 2017MathWorks, Inc 1 Girish Venkataramani, Avinash Nehemiah May 2017 Deep Learning and Vision Algorithm Development in MATLAB Targeting Embedded GPUs
  • 2.
    Copyright © 2017MathWorks, Inc 2 Design Deep Learning & Vision Algorithms Talk Outline High Performance Embedded Implementation Highlights • Manage large image sets • Automate image labeling • Easy access to models • Pre-built training frameworks Highlights • Automate compilation of MATLAB to CUDA • 14x faster than pyCaffe 60% faster than C++ Caffe 3x faster than TensorFlow Accelerate and Scale Training Highlights • Acceleration with GPUs • Scale to clusters
  • 3.
    Copyright © 2017MathWorks, Inc 3 Let’s Use Object Detection as an Example TRUCK SUV CAR In our example we’ll use deep learning for object detection.
  • 4.
    Copyright © 2017MathWorks, Inc 5 Transfer Learning Workflow Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Training Data Labels: Car, Truck, Large Truck, SUV, Van Alexnet, VGG-16, VGG-19, GoogLeNet
  • 5.
    Copyright © 2017MathWorks, Inc 6 Manage Large Sets of Images Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Handle Large Sets of Images Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system imageData = imageDataStore(‘vehicles’) Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system Organize Images in Folders (~ 10,000 images , 5 folders)
  • 6.
    Copyright © 2017MathWorks, Inc 7 Automate Ground Truth Labeling Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Ground Truth Labeling
  • 7.
    Copyright © 2017MathWorks, Inc 8 Automate Ground Truth Labeling Automate Ground Truth Labeling
  • 8.
    Copyright © 2017MathWorks, Inc 9 Access Reference Models in MATLAB Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Easily Load Reference Networks Access Models with 1-line of MATLAB Code Net1 = alexnet Net2 = vgg16 Net3 = vgg19
  • 9.
    Copyright © 2017MathWorks, Inc 10 Access Reference Models in MATLAB Easily manage large sets of images - Single line of code to access images - Operates on disk, database, big-data file system 1. Reference Models 2. Model Importer 3. Tutorials
  • 10.
    Copyright © 2017MathWorks, Inc 11 Modify Network Structure Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Simple MATLAB API to modify layers: layers(23) = fullyConnectedLayer(5, 'Name','fc8'); layers(25) = classificationLayer('Name',‘VehicleClassifier')
  • 11.
    Copyright © 2017MathWorks, Inc 12 Training Object Detectors Transfer Learning Images New Classifier Learn New Weights Modify Network Structure Load Reference NetworkLabels Train Any Network trainNetwork(datastore, layers, options) Pre-built Frameworks for Computer Vision • Deep Learning: R-CNN, Fast R-CNN, Faster R-CNN • Machine Learning: ACF, Cascade Object Detectors
  • 12.
    Copyright © 2017MathWorks, Inc 13 Visualizing and Debugging Intermediate Results Filters … Activations Deep Dream Training Accuracy Visualization Deep Dream Layer Activations Feature Visualization • Many options for visualizations and debugging • Examples to get started
  • 13.
    Copyright © 2017MathWorks, Inc 14 Real World Systems Use More Than Deep Learning Deep learning vehicle detector performance degraded with environmental effects (fog, etc. ) Fog Removal Challenge: Deep learning frameworks do not include “classical” computer vision Solution: Convert MATLAB code with deep learning and computer vision to embedded implementation
  • 14.
    Copyright © 2017MathWorks, Inc 15 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you solve “real” problems for production systems with MATLAB?
  • 15.
    Copyright © 2017MathWorks, Inc 16 Single code change trainingOptions(‘sgdm’,… ‘ExecutionEnvironment’,’CPU’) Accelerate and Scale Computing Multi-core CPU ‘ExecutionEnvironment’,’GPU’) GPU ‘ExecutionEnvironment’,’multi-GPU’) Multiple GPU ‘ExecutionEnvironment’,’parallel’) Cluster/ Cloud
  • 16.
    Copyright © 2017MathWorks, Inc 17 After Many Iterations to Find The Best Model
  • 17.
    Copyright © 2017MathWorks, Inc 18 Talk Outline Design Deep Learning & Vision Algorithms High Performance Embedded Implementation Accelerate and Scale Training Can you create high performance implementation from MATLAB code ?
  • 18.
    Copyright © 2017MathWorks, Inc 19 Presenting the MATLAB to CUDA parallelizing compiler Why? • Alexnet inference using MATLAB solution is • ~14x faster than pyCaffe and 50% faster than C++-Caffe • ~ 4x faster and ~3x less memory-use than TensorFlow
  • 19.
    Copyright © 2017MathWorks, Inc 20 Sample Generated CUDA Code MATLAB source code Auto-generated CUDA code
  • 20.
    Copyright © 2017MathWorks, Inc 21 MATLAB to CUDA compiler flow Control-flow graph Intermediate representation (CFG – IR) Front-end Parallel loop creation Library function mapping CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA code emission …. Traditional compiler optimizations …. (×) cublas-gemm () cuSolver calls fft cuFFT calls nnet cuDNN calls Library function mapping Parallel loop creation Identify loop-nests that will become CUDA kernels … . CUDA kernel creation Convert loop to CUDA kernel Thread/blocks inferred from loop dims cudaMemcpy minimization Shared memory synthesis Perform Use-def analysis. cudaMalloc GPU vars, insert memcpy Infer data locality. Map to shared memory. Synthesize shared memory access CUDA kernel optimizations
  • 21.
    Copyright © 2017MathWorks, Inc 22 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
  • 22.
    Copyright © 2017MathWorks, Inc 23 MATLAB to CUDA compiler: Creating large parallel loops! Control-flow graph Intermediate representation (CFG – IR) Front-end Scalarization Loop perfectization Loop interchange Loop fusion Scalar replacement Library function mapping CUDA code emission …. Traditional compiler optimizations … . Loop optimizations 2 kernels (size N), 20*N bytes 1 kernel (size N), 16*N bytes Scalarization Loop fusion Scalar replacement Parallel loop creation CUDA kernel creation cudaMemcpy minimization Shared memory synthesis CUDA kernel optimizations
  • 23.
    Copyright © 2017MathWorks, Inc 24 cudaMemcpy minimization A(:) = …. C(:) = …. for i = 1:N …. gB = kernel1(gA); gA = kernel2(gB); if (some_condition) gC = kernel3(gA, gB); end …. end …. = C; cudaMemcpy *definitely* needed cudaMemcpy *not* needed cudaMemcpy *may be* needed Observations • Equivalent to Partial redundancy elimination (PRE) • Dynamic strategy – track memory location with a status flag per variable • Use-Def to determine where to insert memcpy A(:) = … A_isDirtyOnCpu = true; … for i = 1:N if (A_isDirtyOnCpu) cudaMemcpy(gA, A); A_isDirtyOnCpu = false; end gB = kernel1(gA); gA = kernel2(gB); if (somecondition) gC = kernel3(gA, gB); C_isDirtyOnGpu = true; end … end … if (C_isDirtyOnGpu) cudaMemcpy(C, gC); C_isDirtyOnGpu = false; end … = C; Assume gA, gB and gC are mapped to GPU memory Generated (pseudo) code
  • 24.
    Copyright © 2017MathWorks, Inc 25 Example: Compiling fog-rectification algorithm
  • 25.
    Copyright © 2017MathWorks, Inc 26 MATLAB to CUDA compilation of computer vision applications Distance transform Fog removal SURF feature extraction Ray tracing Stereo disparity
  • 26.
    Copyright © 2017MathWorks, Inc 27 Deep learning prediction performance: Alexnet Framerate(Fps) Batch Size CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c 0 200 400 600 800 1000 1200 1400 1 16 32 64 Py-Caffe TensorFlow
  • 27.
    Copyright © 2017MathWorks, Inc 28 Deep learning prediction performance: Alexnet 0 1 2 3 4 5 6 7 8 9 CPU resident memory GPU peak memory (nvidia-smi) Memoryusage(GB) Batch Size 1 16 32 64 CPU Intel(R) Xeon(R) CPU E5-1650 v3 @ 3.50 GHz GPU Tesla K40c Py-Caffe MATLABtoCUDAcompiler TensorFlow MATLABonCPU+GPU C++-Caffe
  • 28.
    Copyright © 2017MathWorks, Inc 29 Deep learning prediction performance: Alexnet Jetson (Tegra) TX1 0 50 100 150 200 250 1 16 32 64 128 Framerate(Fps) Batch Size C++-Caffe MATLAB to CUDA compiler
  • 29.
    Copyright © 2017MathWorks, Inc 30 Create CNNs with MATLAB, Deploy with MATLAB to CUDA compiler Alexnet YOLO People detection Lane detection ~20 Fps (K40c) ~30 Fps (Tegra X1) ~66 Fps (Tegra X1) (K40c)
  • 30.
    Copyright © 2017MathWorks, Inc 31 Conclusions Design Deep Learning & Vision Algorithm Accelerate and Scale Training Deep learning design is easy in MATLAB Managing datasets and scaling up training is easy in MATLAB MATLAB to CUDA compiler 10x – 14x faster than pyCaffe 1.3x – 4x faster than TensorFlow 1.07 – 1.6x faster than C++ Caffe High Performance Embedded Implementation
  • 31.
    Copyright © 2017MathWorks, Inc 32 What next? www.mathworks.com/matlab-cuda-beta MATLAB to CUDA compiler: Sign up for our beta program Try deep learning in MATLAB Visit our booth and see our demos Booth #: 808