© 2019 MathWorks, Inc.
Deploying Deep Learning Models on
Embedded Processors for
Autonomous Systems with MATLAB
Bill Chou, Sandeep Hiremath
MathWorks
May 2019
© 2019 MathWorks, Inc.
Autonomous Systems
2
© 2019 MathWorks, Inc.
Autonomous Systems
Control
Planning
Perception
3
© 2019 MathWorks, Inc.
Control
Planning
Perception
Deep Learning for Perception in Autonomous Systems
Path planning
Sensor models &
model predictive control
Deep learning
Sensor fusion
4
© 2019 MathWorks, Inc.
Deep Learning in Automated Driving
5
© 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
Key Takeaways
Platform Productivity
Framework Interoperability
Key Takeaways
Optimized C/C++ and CUDA
Hardware Targeting
Processor-in-loop (PIL) Testing
6
© 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Example Used in Today’s Talk
7
© 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
8
© 2019 MathWorks, Inc.
Ground Truth Labeling App
9
© 2019 MathWorks, Inc.
Automate Labeling
Lane Markers Vehicle Bounding Boxes
© 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Deep Learning
Models
11
© 2019 MathWorks, Inc.
Importing Pre-trained Models
>> net = alexnet
OR
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Detector object
12
© 2019 MathWorks, Inc.
Interactive Network Design
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Detector object
13
© 2019 MathWorks, Inc.
Accelerated Training
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Evaluate trained
network
Single CPU
Single CPU
Single GPU
Single CPU
Multiple GPUs Cloud GPUs
14
© 2019 MathWorks, Inc.
Network Evaluation
Modify network layers
Import Pre-trained
networks
(Alexnet, ResNet50)
Re-train network with
training data
Evaluate trained
network
15
© 2019 MathWorks, Inc.
Lane and Object Detectors Running in MATLAB
16
© 2019 MathWorks, Inc.
Lane and Object Detectors Running in MATLAB
17
© 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
18
© 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
19
© 2019 MathWorks, Inc.
Logic Logic
Input Output
20
© 2019 MathWorks, Inc.
Multi-Platform Deep Learning Deployment
NVIDIA Jetson
21
Logic Logic
Data CenterWorkstation NVIDIA DRIVE Raspberry Pi
© 2019 MathWorks, Inc.
Multi-Platform Deep Learning Deployment
GPU Coder MATLAB Coder
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs 22
Logic Logic
© 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Generate Code from Non-Deep Learning Parts
Generate Optimized CUDA/C++ Code
23
© 2019 MathWorks, Inc.
2200+ Functions for C/C++, 380+ Functions for CUDA
Comm.
Toolbox
DSP
System
Toolbox
Image
Processing
Toolbox
Computer
Vision
Toolbox
Signal
Processing
Toolbox
Sensor
Fusion
Tracking
Toolbox
Wavelet
ToolboxWLAN
Toolbox
Phased
Array
System
Toolbox
Statistics
&
Machine
Learning
Toolbox
Core
Math
Fixed-
Point
Designer
Automated
Driving
Toolbox
Robotics
System
Toolbox
5G
Toolbox
24
© 2019 MathWorks, Inc.
Mapped to Optimization Libraries
NVIDIA GPUs
Intel CPUs
ARM Cortex-A CPUs
MATLAB
Coder
GPU
Coder
cuBLAS
cuFFT
cuSolver
Thrust
MKL-
DNN
FFTW
BLAS
TensorRT
cuDNN ARM
Compute
Library
OpenCV
OpenCV
© 2019 MathWorks, Inc.
GPUs: Automatically Extract Parallelism from MATLAB
1. Scalarized MATLAB
(“for-all” loops)
2. Vectorized MATLAB
(math operators and library functions)
3. Composite functions in MATLAB
(maps to cuBLAS, cuFFT, cuSolver,
cuDNN, TensorRT)
Infer CUDA
kernels from
MATLAB loops
Library
replacement
26
© 2019 MathWorks, Inc.
GPU Coder Compiler Transforms & Optimizations
Control-Flow Graph
Intermediate Representation
….….
CUDA Kernel
Lowering
Front End
Traditional Compiler
Optimizations
MATLAB
Library Function Mapping
Parallel Loop Creation
CUDA Kernel Creation
cudaMemcpy Minimization
Shared Memory Mapping
CUDA Code Emission
Scalarization
Loop Perfectization
Loop Interchange
Loop Fusion
Scalar Replacement
Loop
Optimizations
27
© 2019 MathWorks, Inc.
Input
Lane
Detection Coordinate
Transform
Bounding
Box
Processing
Object
Detection
Perception in Autonomous Application
Output
Generate Optimized Inference Code
Layer Fusion
Deep Learning Network Optimizations
Memory
Optimization
Network Re-
architecture
Generate Code from Deep Learning Networks
28
© 2019 MathWorks, Inc.
Original Network
Deep Learning Network Optimizations
Conv
Batch
Norm
ReLu
Add
Conv
ReLu
Max
Pool
Max
Pool
Layer Fusion
Optimized Computation
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Max
Pool
Buffer Minimization
Optimized Memory
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Buffer A
Buffer B
Buffer D
Max
Pool
Buffer C
Buffer E
X
Reuse Buffer A
X
Reuse Buffer B
29
© 2019 MathWorks, Inc.
Original Network
Supported Pretrained Networks
Conv
Batch
Norm
ReLu
Add
Conv
ReLu
Max
Pool
Max
Pool
Layer Fusion
Optimized Computation
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Max
Pool
Buffer Minimization
Optimized Memory
Fused
Conv
Fused Conv
BatchNormAdd
Max
Pool
Buffer A
Buffer B
Buffer D
Max
Pool
Buffer C
Buffer E
X
Reuse Buffer A
X
Reuse Buffer B
30
SegNet
ResNet-50
VGG-19 Inception-v3
SqueezeNet
VGG-16
AlexNet
GoogLeNet
ResNet-101
© 2019 MathWorks, Inc.
SegNet
ResNet-50
VGG-19 Inception-v3
SqueezeNet
VGG-16
AlexNet
GoogLeNet
ResNet-101
31
Optimized Deep Learning Libraries & Runtimes
MKL-
DNN
ARM
Compute
Library
cuDNN TensorRT
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs
GPU
Coder
MATLAB
Coder
© 2019 MathWorks, Inc. 32
MKL-
DNN
ARM
Compute
Library
cuDNN TensorRT
NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs
GPU
Coder
MATLAB
Coder
Semantic Segmentation Defective Product Detection
Blood Smear Segmentation
© 2019 MathWorks, Inc.
Generating CUDA Code and Run on Titan V GPU
33
© 2019 MathWorks, Inc.
How is the
Performance?
34
© 2019 MathWorks, Inc.
Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0
Single Image Inference on Titan V using cuDNN
PyTorch (1.0.0)
MXNet (1.4.0)
GPU Coder (R2019a)
TensorFlow (1.13.0)
35
© 2019 MathWorks, Inc.
TensorRT Accelerates Inference on Titan V
Single Image Inference with ResNet-50 (Titan V)
cuDNN TensorRT (FP32) TensorRT (INT8)
GPU Coder
TensorFlow
36
© 2019 MathWorks, Inc.
Single Image Inference on CPU
MATLAB
TensorFlow
MXNet
MATLAB Coder
PyTorch
CPU, Single Image Inference (Linux)
Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1
37
© 2019 MathWorks, Inc.
Outline
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
38
© 2019 MathWorks, Inc.
Access Target Peripherals from MATLAB
39
Jetson AGX Xavier
Host Machine
DRIVE AGX
Raspberry Pi
Peripheral Data
© 2019 MathWorks, Inc.
Jetson AGX Xavier
DRIVE AGX
Raspberry Pi
Deploy Application to Target Boards
40
Host Machine
Generated
CUDA Code
Generated
C/C++ Code
© 2019 MathWorks, Inc.
Deploy Application to Jetson AGX Xavier
Deploy
Generated
CUDA Code
Target Display
Video Feed
41
Jetson AGX Xavier
Host Machine
© 2019 MathWorks, Inc.
Deploy Application to Jetson AGX Xavier
42
© 2019 MathWorks, Inc.
Deploy
Generated
CUDA Code
Processor-in-the-Loop (PIL) Testing on Hardware Boards
Jetson AGX Xavier
Host Machine
Send Inputs &
Compare Results
Data
Exchange
43
© 2019 MathWorks, Inc.
Musashi Seimitsu Industry Co.,Ltd.
Detect Abnormalities in Automotive Parts
MATLAB use in project:
• Preprocessing of captured images
• Image annotation for training
• Deep learning based analysis
• Various transfer learning methods
(Combinations of CNN models,
Classifiers)
• Estimation of defect area using Class
Activation Map (CAM)
• Abnormality/defect classification
• Deployment to NVIDIA Jetson using
GPU Coder
Automated visual inspection of 1.3 million bevel
gear per month
44
© 2019 MathWorks, Inc.
Summary
Ground Truth
Labeling
Network Design
and Training
C/C++ and CUDA
Code Generation
Hardware Targeting
(CPUs and GPUs)
45
Key Takeaways
Platform Productivity
Framework Interoperability
Key Takeaways
Optimized C/C++ and CUDA
Hardware Targeting
Processor-in-loop (PIL) Testing
© 2019 MathWorks, Inc.
Thank You
46

"Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB," a Presentation from MathWorks

  • 1.
    © 2019 MathWorks,Inc. Deploying Deep Learning Models on Embedded Processors for Autonomous Systems with MATLAB Bill Chou, Sandeep Hiremath MathWorks May 2019
  • 2.
    © 2019 MathWorks,Inc. Autonomous Systems 2
  • 3.
    © 2019 MathWorks,Inc. Autonomous Systems Control Planning Perception 3
  • 4.
    © 2019 MathWorks,Inc. Control Planning Perception Deep Learning for Perception in Autonomous Systems Path planning Sensor models & model predictive control Deep learning Sensor fusion 4
  • 5.
    © 2019 MathWorks,Inc. Deep Learning in Automated Driving 5
  • 6.
    © 2019 MathWorks,Inc. Outline Ground Truth Labeling Network Design and Training C/C++ and CUDA Code Generation Hardware Targeting (CPUs and GPUs) Key Takeaways Platform Productivity Framework Interoperability Key Takeaways Optimized C/C++ and CUDA Hardware Targeting Processor-in-loop (PIL) Testing 6
  • 7.
    © 2019 MathWorks,Inc. Input Lane Detection Coordinate Transform Bounding Box Processing Object Detection Perception in Autonomous Application Output Example Used in Today’s Talk 7
  • 8.
    © 2019 MathWorks,Inc. Outline Ground Truth Labeling Network Design and Training C/C++ and CUDA Code Generation Hardware Targeting (CPUs and GPUs) 8
  • 9.
    © 2019 MathWorks,Inc. Ground Truth Labeling App 9
  • 10.
    © 2019 MathWorks,Inc. Automate Labeling Lane Markers Vehicle Bounding Boxes
  • 11.
    © 2019 MathWorks,Inc. Input Lane Detection Coordinate Transform Bounding Box Processing Object Detection Perception in Autonomous Application Output Deep Learning Models 11
  • 12.
    © 2019 MathWorks,Inc. Importing Pre-trained Models >> net = alexnet OR Modify network layers Import Pre-trained networks (Alexnet, ResNet50) Re-train network with training data Detector object 12
  • 13.
    © 2019 MathWorks,Inc. Interactive Network Design Modify network layers Import Pre-trained networks (Alexnet, ResNet50) Re-train network with training data Detector object 13
  • 14.
    © 2019 MathWorks,Inc. Accelerated Training Modify network layers Import Pre-trained networks (Alexnet, ResNet50) Re-train network with training data Evaluate trained network Single CPU Single CPU Single GPU Single CPU Multiple GPUs Cloud GPUs 14
  • 15.
    © 2019 MathWorks,Inc. Network Evaluation Modify network layers Import Pre-trained networks (Alexnet, ResNet50) Re-train network with training data Evaluate trained network 15
  • 16.
    © 2019 MathWorks,Inc. Lane and Object Detectors Running in MATLAB 16
  • 17.
    © 2019 MathWorks,Inc. Lane and Object Detectors Running in MATLAB 17
  • 18.
    © 2019 MathWorks,Inc. Outline Ground Truth Labeling Network Design and Training C/C++ and CUDA Code Generation Hardware Targeting (CPUs and GPUs) 18
  • 19.
    © 2019 MathWorks,Inc. Input Lane Detection Coordinate Transform Bounding Box Processing Object Detection Perception in Autonomous Application Output 19
  • 20.
    © 2019 MathWorks,Inc. Logic Logic Input Output 20
  • 21.
    © 2019 MathWorks,Inc. Multi-Platform Deep Learning Deployment NVIDIA Jetson 21 Logic Logic Data CenterWorkstation NVIDIA DRIVE Raspberry Pi
  • 22.
    © 2019 MathWorks,Inc. Multi-Platform Deep Learning Deployment GPU Coder MATLAB Coder NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs 22 Logic Logic
  • 23.
    © 2019 MathWorks,Inc. Input Lane Detection Coordinate Transform Bounding Box Processing Object Detection Perception in Autonomous Application Output Generate Code from Non-Deep Learning Parts Generate Optimized CUDA/C++ Code 23
  • 24.
    © 2019 MathWorks,Inc. 2200+ Functions for C/C++, 380+ Functions for CUDA Comm. Toolbox DSP System Toolbox Image Processing Toolbox Computer Vision Toolbox Signal Processing Toolbox Sensor Fusion Tracking Toolbox Wavelet ToolboxWLAN Toolbox Phased Array System Toolbox Statistics & Machine Learning Toolbox Core Math Fixed- Point Designer Automated Driving Toolbox Robotics System Toolbox 5G Toolbox 24
  • 25.
    © 2019 MathWorks,Inc. Mapped to Optimization Libraries NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs MATLAB Coder GPU Coder cuBLAS cuFFT cuSolver Thrust MKL- DNN FFTW BLAS TensorRT cuDNN ARM Compute Library OpenCV OpenCV
  • 26.
    © 2019 MathWorks,Inc. GPUs: Automatically Extract Parallelism from MATLAB 1. Scalarized MATLAB (“for-all” loops) 2. Vectorized MATLAB (math operators and library functions) 3. Composite functions in MATLAB (maps to cuBLAS, cuFFT, cuSolver, cuDNN, TensorRT) Infer CUDA kernels from MATLAB loops Library replacement 26
  • 27.
    © 2019 MathWorks,Inc. GPU Coder Compiler Transforms & Optimizations Control-Flow Graph Intermediate Representation ….…. CUDA Kernel Lowering Front End Traditional Compiler Optimizations MATLAB Library Function Mapping Parallel Loop Creation CUDA Kernel Creation cudaMemcpy Minimization Shared Memory Mapping CUDA Code Emission Scalarization Loop Perfectization Loop Interchange Loop Fusion Scalar Replacement Loop Optimizations 27
  • 28.
    © 2019 MathWorks,Inc. Input Lane Detection Coordinate Transform Bounding Box Processing Object Detection Perception in Autonomous Application Output Generate Optimized Inference Code Layer Fusion Deep Learning Network Optimizations Memory Optimization Network Re- architecture Generate Code from Deep Learning Networks 28
  • 29.
    © 2019 MathWorks,Inc. Original Network Deep Learning Network Optimizations Conv Batch Norm ReLu Add Conv ReLu Max Pool Max Pool Layer Fusion Optimized Computation Fused Conv Fused Conv BatchNormAdd Max Pool Max Pool Buffer Minimization Optimized Memory Fused Conv Fused Conv BatchNormAdd Max Pool Buffer A Buffer B Buffer D Max Pool Buffer C Buffer E X Reuse Buffer A X Reuse Buffer B 29
  • 30.
    © 2019 MathWorks,Inc. Original Network Supported Pretrained Networks Conv Batch Norm ReLu Add Conv ReLu Max Pool Max Pool Layer Fusion Optimized Computation Fused Conv Fused Conv BatchNormAdd Max Pool Max Pool Buffer Minimization Optimized Memory Fused Conv Fused Conv BatchNormAdd Max Pool Buffer A Buffer B Buffer D Max Pool Buffer C Buffer E X Reuse Buffer A X Reuse Buffer B 30 SegNet ResNet-50 VGG-19 Inception-v3 SqueezeNet VGG-16 AlexNet GoogLeNet ResNet-101
  • 31.
    © 2019 MathWorks,Inc. SegNet ResNet-50 VGG-19 Inception-v3 SqueezeNet VGG-16 AlexNet GoogLeNet ResNet-101 31 Optimized Deep Learning Libraries & Runtimes MKL- DNN ARM Compute Library cuDNN TensorRT NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs GPU Coder MATLAB Coder
  • 32.
    © 2019 MathWorks,Inc. 32 MKL- DNN ARM Compute Library cuDNN TensorRT NVIDIA GPUs Intel CPUs ARM Cortex-A CPUs GPU Coder MATLAB Coder Semantic Segmentation Defective Product Detection Blood Smear Segmentation
  • 33.
    © 2019 MathWorks,Inc. Generating CUDA Code and Run on Titan V GPU 33
  • 34.
    © 2019 MathWorks,Inc. How is the Performance? 34
  • 35.
    © 2019 MathWorks,Inc. Intel® Xeon® CPU 3.6 GHz - NVIDIA libraries: CUDA10 - cuDNN 7 - Frameworks: TensorFlow 1.13.0, MXNet 1.4.0 PyTorch 1.0.0 Single Image Inference on Titan V using cuDNN PyTorch (1.0.0) MXNet (1.4.0) GPU Coder (R2019a) TensorFlow (1.13.0) 35
  • 36.
    © 2019 MathWorks,Inc. TensorRT Accelerates Inference on Titan V Single Image Inference with ResNet-50 (Titan V) cuDNN TensorRT (FP32) TensorRT (INT8) GPU Coder TensorFlow 36
  • 37.
    © 2019 MathWorks,Inc. Single Image Inference on CPU MATLAB TensorFlow MXNet MATLAB Coder PyTorch CPU, Single Image Inference (Linux) Intel® Xeon® CPU 3.6 GHz - Frameworks: TensorFlow 1.6.0, MXNet 1.2.1, PyTorch 0.3.1 37
  • 38.
    © 2019 MathWorks,Inc. Outline Ground Truth Labeling Network Design and Training C/C++ and CUDA Code Generation Hardware Targeting (CPUs and GPUs) 38
  • 39.
    © 2019 MathWorks,Inc. Access Target Peripherals from MATLAB 39 Jetson AGX Xavier Host Machine DRIVE AGX Raspberry Pi Peripheral Data
  • 40.
    © 2019 MathWorks,Inc. Jetson AGX Xavier DRIVE AGX Raspberry Pi Deploy Application to Target Boards 40 Host Machine Generated CUDA Code Generated C/C++ Code
  • 41.
    © 2019 MathWorks,Inc. Deploy Application to Jetson AGX Xavier Deploy Generated CUDA Code Target Display Video Feed 41 Jetson AGX Xavier Host Machine
  • 42.
    © 2019 MathWorks,Inc. Deploy Application to Jetson AGX Xavier 42
  • 43.
    © 2019 MathWorks,Inc. Deploy Generated CUDA Code Processor-in-the-Loop (PIL) Testing on Hardware Boards Jetson AGX Xavier Host Machine Send Inputs & Compare Results Data Exchange 43
  • 44.
    © 2019 MathWorks,Inc. Musashi Seimitsu Industry Co.,Ltd. Detect Abnormalities in Automotive Parts MATLAB use in project: • Preprocessing of captured images • Image annotation for training • Deep learning based analysis • Various transfer learning methods (Combinations of CNN models, Classifiers) • Estimation of defect area using Class Activation Map (CAM) • Abnormality/defect classification • Deployment to NVIDIA Jetson using GPU Coder Automated visual inspection of 1.3 million bevel gear per month 44
  • 45.
    © 2019 MathWorks,Inc. Summary Ground Truth Labeling Network Design and Training C/C++ and CUDA Code Generation Hardware Targeting (CPUs and GPUs) 45 Key Takeaways Platform Productivity Framework Interoperability Key Takeaways Optimized C/C++ and CUDA Hardware Targeting Processor-in-loop (PIL) Testing
  • 46.
    © 2019 MathWorks,Inc. Thank You 46