"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks

© 2019 MathWorks, Inc.
Deploying Deep Learning Models on
Embedded Processors for
Autonomous Systems with MATLAB
Bill Chou, Sandeep Hiremath
MathWorks
May 2019

Autonomous Systems
2

Autonomous Systems
Control
Planning
Perception
3

Control
Planning
Perception
Deep Learning for Perception in Autonomous Systems
Path planning
Sensor models &
model predictive control
Deep learning
Sensor fusion
4

Deep Learning in Automated Driving
5

Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
Key Takeaways
Platform Productivity
Framework Interoperability
Key Takeaways
Optimized C/C++ and CUDA
Hardware Targeting
Processor-in-loop (PIL) Testing
6

Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Example Used in Today’s Talk
7

Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
8

Ground Truth Labeling App
9

Automate Labeling
Lane Markers Vehicle Bounding Boxes

Input
Lane
Transform
Bounding
Box
Processing
Object
Detection
Output
Deep Learning
Models
11

Importing Pre-trained Models
>> net = alexnet
OR
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Detector object
12

Interactive Network Design
Import Pre-trained
networks
(Alexnet, ResNet50)
training data
Detector object
13

Accelerated Training
Import Pre-trained
networks
(Alexnet, ResNet50)
training data
Evaluate trained
network
Single CPU
Single CPU
Single GPU
Single CPU
Multiple GPUs Cloud GPUs
14

Network Evaluation
Import Pre-trained
networks
(Alexnet, ResNet50)
training data
Evaluate trained
network
15

Lane and Object Detectors Running in MATLAB
16

Lane and Object Detectors Running in MATLAB
17

Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
18

Input
Lane
Transform
Bounding
Box
Processing
Object
Detection
Output
19

Logic Logic
Input Output
20

Multi-Platform Deep Learning Deployment
NVIDIA Jetson
21
Logic Logic
Data CenterWorkstation NVIDIA DRIVE Raspberry Pi

Multi-Platform Deep Learning Deployment
GPU Coder MATLAB Coder
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs 22
Logic Logic

Input
Lane
Transform
Bounding
Box
Processing
Object
Detection
Output
Generate Code from Non-Deep Learning Parts
Generate Optimized CUDA/C++ Code
23

2200+ Functions for C/C++, 380+ Functions for CUDA
Comm.
Toolbox
DSP
System
Toolbox
Image
Processing
Toolbox
Computer
Vision
Toolbox
Signal
Processing
Toolbox
Sensor
Fusion
Tracking
Toolbox
Wavelet
ToolboxWLAN
Toolbox
Phased
Array
System
Toolbox
Statistics
&
Machine
Learning
Toolbox
Core
Math
Fixed-
Point
Designer
Automated
Driving
Toolbox
Robotics
System
Toolbox
5G
Toolbox
24

Mapped to Optimization Libraries
NVIDIA GPUs
Intel CPUs
ARM Cortex-A CPUs
MATLAB
Coder
GPU
Coder
cuBLAS
cuFFT
cuSolver
Thrust
MKL-
DNN
FFTW
BLAS
TensorRT
cuDNN ARM
Compute
Library
OpenCV
OpenCV

GPUs: Automatically Extract Parallelism from MATLAB
1. Scalarized MATLAB
(“for-all” loops)
2. Vectorized MATLAB
(math operators and library functions)
3. Composite functions in MATLAB
(maps to cuBLAS, cuFFT, cuSolver,
cuDNN, TensorRT)
Infer CUDA
kernels from
MATLAB loops
Library
replacement
26

GPU Coder Compiler Transforms & Optimizations
Control-Flow Graph
Intermediate Representation
….….
CUDA Kernel
Lowering
Front End
Traditional Compiler
Optimizations
MATLAB
Library Function Mapping
Parallel Loop Creation
CUDA Kernel Creation
cudaMemcpy Minimization
Shared Memory Mapping
CUDA Code Emission
Scalarization
Loop Perfectization
Loop Interchange
Loop Fusion
Scalar Replacement
Loop
Optimizations
27

Input
Lane
Transform
Bounding
Box
Processing
Object
Detection
Output
Generate Optimized Inference Code
Layer Fusion
Deep Learning Network Optimizations
Memory
Optimization
Network Re-
architecture
Generate Code from Deep Learning Networks
28

Original Network
Deep Learning Network Optimizations
Conv
Batch
Norm
ReLu
Add
Conv
ReLu
Max
Pool
Max
Pool
Layer Fusion
Optimized Computation
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Max
Pool
Buffer Minimization
Optimized Memory
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Buffer A
Buffer B
Buffer D
Max
Pool
Buffer C
Buffer E
X
Reuse Buffer A
X
Reuse Buffer B
29

Original Network
Supported Pretrained Networks
Conv
Batch
Norm
ReLu
Add
Conv
ReLu
Max
Pool
Max
Pool
Layer Fusion
Optimized Computation
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Max
Pool
Buffer Minimization
Optimized Memory
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Buffer A
Buffer B
Buffer D
Max
Pool
Buffer C
Buffer E
X
Reuse Buffer A
X
Reuse Buffer B
30
SegNet
ResNet-50
VGG-19 Inception-v3
SqueezeNet
VGG-16
AlexNet
GoogLeNet
ResNet-101

SegNet
ResNet-50
VGG-19 Inception-v3
SqueezeNet
VGG-16
AlexNet
GoogLeNet
ResNet-101
31
Optimized Deep Learning Libraries & Runtimes
MKL-
DNN
ARM
Compute
Library
cuDNN TensorRT
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs
GPU
Coder
MATLAB
Coder

© 2019 MathWorks, Inc. 32
MKL-
DNN
ARM
Compute
Library
cuDNN TensorRT
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs
GPU
Coder
MATLAB
Coder
Semantic Segmentation Defective Product Detection
Blood Smear Segmentation

Generating CUDA Code and Run on Titan V GPU
33

How is the
Performance?
34

Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0
Single Image Inference on Titan V using cuDNN
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
35

TensorRT Accelerates Inference on Titan V
Single Image Inference with ResNet-50 (Titan V)
cuDNN TensorRT (FP32) TensorRT (INT8)
GPU Coder
TensorFlow
36

Single Image Inference on CPU
MATLAB
TensorFlow
MXNet
MATLAB Coder
PyTorch
CPU, Single Image Inference (Linux)
Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1
37

Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
38

Access Target Peripherals from MATLAB
39
Jetson AGX Xavier
Host Machine
DRIVE AGX
Raspberry Pi
Peripheral Data

Jetson AGX Xavier
DRIVE AGX
Raspberry Pi
Deploy Application to Target Boards
40
Host Machine
Generated
CUDA Code
Generated
C/C++ Code

Deploy Application to Jetson AGX Xavier
Deploy
Generated
CUDA Code
Target Display
Video Feed
41
Jetson AGX Xavier
Host Machine

Deploy Application to Jetson AGX Xavier
42

Deploy
Generated
CUDA Code
Processor-in-the-Loop (PIL) Testing on Hardware Boards
Jetson AGX Xavier
Host Machine
Send Inputs &
Compare Results
Data
Exchange
43

Musashi Seimitsu Industry Co.,Ltd.
Detect Abnormalities in Automotive Parts
MATLAB use in project:
• Preprocessing of captured images
• Image annotation for training
• Deep learning based analysis
• Various transfer learning methods
(Combinations of CNN models,
Classifiers)
• Estimation of defect area using Class
Activation Map (CAM)
• Abnormality/defect classification
• Deployment to NVIDIA Jetson using
GPU Coder
Automated visual inspection of 1.3 million bevel
gear per month
44

Summary
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
45
Key Takeaways
Platform Productivity
Framework Interoperability
Key Takeaways
Optimized C/C++ and CUDA
Hardware Targeting
Processor-in-loop (PIL) Testing

Thank You
46

"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks

More Related Content

What's hot

Similar to "Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks

More from Edge AI and Vision Alliance

Recently uploaded

"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks