Convolutional Neural Networks(CNN) and Computer Vision

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Deep Learning for Self-Driving Cars
6.S094: Deep Learning for Self-Driving Cars
Learning to Drive: Convolutional Neural Networks
and End-to-End Learning of the Full Driving Tasks
cars.mit.edu

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Administrative
• Website: cars.mit.edu
• Contact Email: deepcars@mit.edu
• Required:
• Create an account on the website.
• Follow the tutorial for each of the 2 projects.
• Recommended:
• Ask questions
• Win competition!
• Office hours: Friday, 5-7pm
(more info coming soon)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Administrative
• Website: cars.mit.edu
• Contact Email: deepcars@mit.edu
• Required:
• Create an account on the website.
• Follow the tutorial for each of the 2 projects.
• Recommended:
• Ask questions
• Win competition!

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Schedule

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
DeepTraffic Leaderboard

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Illustrative Case Study: Traffic Light Detection

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
DeepTesla: End-to-End Learning from Human and Autopilot Driving
(in ConvnetJS)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
DeepTesla: End-to-End Learning from Human and Autopilot Driving
(in TensorFlow)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Supervised
Learning
Unsupervised
Learning
Semi-Supervised
Learning
Reinforcement
Learning
Standard supervised learning pipeline:
Computer Vision is Machine Learning
References: [81]
Computer Vision

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Images are Numbers
References: [89]
• Regression: The output variable takes continuous values
• Classification: The output variable takes class labels
• Underneath it may still produce continuous values such as
probability of belonging to a particular class.

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Computer Vision is Hard
References: [66, 69, 89]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Image Classification Pipeline
References: [81, 89]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Famous Computer Vision Datasets
References: [90, 91, 92, 93]
MNIST: handwritten digits ImageNet: WordNet hierarchy
CIFAR-10(0): tiny images Places: natural scenes

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Let’s Build an Image Classifier for CIFAR-10

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Let’s Build an Image Classifier for CIFAR-10
Accuracy
Random: 10%
Our image-diff (with L1): 38.6%
Our image-diff (with L2): 35.4%

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
K-Nearest Neighbors: Generalizing the Image-Diff Classifier
References: [89]
Tuning (hyper)parameters:

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
K-Nearest Neighbors: Generalizing the Image-Diff Classifier
Accuracy
Random: 10%
Training and testing on the same data: 35.4%
7-Nearest Neighbors: ~30%
Human: ~94%
…
Convolutional Neural Networks: ~95%

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Reminder: Weighing the Evidence
References: [78]
Evidence
Decisions

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Reminder: Classify and Image of a Number
References: [80]
Input:
(28x28)
Network:

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Reminder: “Learning” is Optimization of a Function
Ground truth for “6”:
“Loss” function:

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Convolutional Neural Networks
References: [95]
Regular neural network (fully connected):
Convolutional neural network:
Each layer takes a 3d volume, produces 3d volume with some
smooth function that may or may not have parameters.

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Convolutional Neural Networks: Layers
• INPUT [32x32x3] will hold the raw pixel values of the image, in this case an image of width 32, height 32, and
with three color channels R,G,B.
• CONV layer will compute the output of neurons that are connected to local regions in the input, each computing
a dot product between their weights and a small region they are connected to in the input volume. This may
result in volume such as [32x32x12] if we decided to use 12 filters.
• RELU layer will apply an elementwise activation function, such as the max(0,x) thresholding at zero. This leaves
the size of the volume unchanged ([32x32x12]).
• POOL layer will perform a downsampling operation along the spatial dimensions (width, height), resulting in
volume such as [16x16x12].
• FC (i.e. fully-connected) layer will compute the class scores, resulting in volume of size [1x1x10], where each of
the 10 numbers correspond to a class score, such as among the 10 categories of CIFAR-10. As with ordinary
Neural Networks and as the name implies, each neuron in this layer will be connected to all the numbers in the
previous volume.
References: [95]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Dealing with Images: Local Connectivity
Same neuron. Just more focused (narrow “receptive field”).
The parameters on a each filter are spatially “shared”
(if a feature is useful in one place, it’s useful elsewhere)
References: [95]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNets: Spatial Arrangement of Output Volume
• Depth: number of filters
• Stride: filter step size (when we “slide” it)
• Padding: zero-pad the input
References: [95]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
References: [95]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNets: Pooling
References: [95]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Computer Vision:
Object Recognition / Classification
References: [4]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Original Ground Truth FCN-8
Computer Vision:
Segmentation
References: [96]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Computer Vision:
Object Detection
References: [97]

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
How Can Convolutional Neural Networks Help Us Drive?

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Driving: The Numbers
(in United States, in 2014)
• All drivers: 10,658 miles
(29.2 miles per day)
• Rural drivers: 12,264 miles
• Urban drivers: 9,709 miles
• Fatal crashes: 29,989
• All fatalities: 32,675
• Car occupants: 12,507
• SUV occupants: 8,320
• Pedestrians: 4,884
• Motorcycle: 4,295
• Bicyclists: 720
• Large trucks: 587
Miles Fatalities

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Cars We Drive

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Human at the Center of Automation:
The Way to Full Autonomy Includes the Human
Ford F150 Tesla Model S Google Self-Driving Car
Fully
Human
Controlled
Fully
Machine
Controlled

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Human at the Center of Automation:
The Way to Full Autonomy Includes the Human
• Emergency
• Automatic emergency breaking (AEB)
• Warnings
• Lane departure warning (LDW)
• Forward collision warning (FCW)
• Blind spot detection
• Longitudinal
• Adaptive cruise control (ACC)
• Lateral
• Lane keep assist (LKA)
• Automatic steering
• Control and Planning
• Automatic lane change
• Automatic parking
Tesla Autopilot

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Distracted Humans
• Injuries and fatalities:
3,179 people were killed and 431,000 were
injured in motor vehicle crashes involving
distracted drivers
(in 2014)
• Texts:
169.3 billion text messages were sent in the
US every month.
(as of December 2014)
• Eye off road:
5 seconds is the average time your eyes are
off the road while texting. When traveling
at 55mph, that's enough time to cover the
length of a football field blindfolded.
What is distracted driving?
• Texting
• Using a smartphone
• Eating and drinking
• Talking to passengers
• Grooming
• Reading, including maps
• Using a navigation system
• Watching a video
• Adjusting a radio

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
4 D’s of Being Human:
Drunk, Drugged, Distracted, Drowsy
• Drunk Driving: In 2014, 31 percent of traffic fatalities
involved a drunk driver.
• Drugged Driving: 23% of night-time drivers tested positive
for illegal, prescription or over-the-counter medications.
• Distracted Driving: In 2014, 3,179 people (10 percent of
overall traffic fatalities) were killed in crashes involving
distracted drivers.
• Drowsy Driving: In 2014, nearly three percent of all traffic
fatalities involved a drowsy driver, and at least 846 people
were killed in crashes involving a drowsy driver.

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
In Context: Traffic Fatalities
Total miles driven in U.S. in 2014:
3,000,000,000,000 (3 million million)
Fatalities: 32,675
(1 in 90 million)
Tesla Autopilot miles driven since October 2015:
300,000,000 (300 million)
(as of December 2016)
Fatalities: 1

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
3,000,000,000,000 (3 million million)
Fatalities: 32,675
(1 in 90 million)
300,000,000 (300 million)
Fatalities: 1

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
3,000,000,000,000 (3 million million)
Fatalities: 32,675
(1 in 90 million)
300,000,000 (300 million)
Fatalities: 1
We (increasingly) understand this
We do not understand this (yet)
We need A LOT of real-world semi-autonomous driving data!
Computer Vision + Machine Learning + Big Data = Understanding

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
The Data
Teslas instrumented: 17
Hours of data: 5,000+ hours
Distance traveled: 70,000+ miles

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
The Data

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Camera and Lens Selection
Fisheye: Capture full range of head, body
movement inside vehicle.
2.8-12mm Focal Length: “Zoom” on the face
without obstructing the driver’s view.
Logitech C920:
On-board H264 Compression
Case for C-Mount Lens:
Flexibility in lens selection

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Semi-Autonomous Vehicle Components
External
1. Radar
2. Visible-light camera
3. LIDAR
4. Infrared camera
5. Stereo vision
6. GPS/IMU
7. CAN
8. Audio
Internal
1. Visible-light camera
2. Infrared camera
3. Audio

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Self-Driving Car Tasks
• Localization and Mapping:
Where am I?
• Scene Understanding:
Where is everyone else?
• Movement Planning:
How do I get from A to B?
• Driver State:
What’s the driver up to?

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Visual Odometry
• 6-DOF: freed of movement
• Changes in position:
• Forward/backward: surge
• Left/right: sway
• Up/down: heave
• Orientation:
• Pitch, Yaw, Roll
• Source:
• Monocular: I moved 1 unit
• Stereo: I moved 1 meter
• Mono = Stereo for far away objects
• PS: For tiny robots everything is “far away” relative to inter-camera
distance

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
SLAM: Simultaneous Localization and Mapping
What works: SIFT and optical flow

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Visual Odometry in Parts
• (Stereo) Undistortion, Rectification
• (Stereo) Disparity Map Computation
• Feature Detection (e.g., SIFT, FAST)
• Feature Tracking (e.g., KLT: Kanade-Lucas-Tomasi)
• Trajectory Estimation
• Use rigid parts of the scene (requires outlier/inlier detection)
• For mono, need more info* like camera orientation and height of
off the ground
* Kitt, Bernd Manfred, et al. "Monocular visual odometry using a planar road model to solve scale ambiguity." (2011).

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
End-to-End Visual Odometry
Konda, Kishore, and Roland Memisevic. "Learning visual odometry with a convolutional
network." International Conference on Computer Vision Theory and Applications. 2015.

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Object Detection
• Past approaches: cascades classifiers (Haar-like features)
• Where deep learning can help:
recognition, classification, detection

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Full Driving Scene Segmentation
Fully Convolutional Network implementation:
https://github.com/tkuanlun350/Tensorflow-SegNet

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Road Texture and Condition from Audio
(with Recurrent Neural Networks)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
• Previous approaches: optimization-based control
• Where deep learning can help: reinforcement learning
Deep Reinforcement Learning implementation:
https://github.com/nivwusquorum/tensorflow-deepq

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Self-Driving Car Tasks
• Localization:
Where am I?
• Object detection:
Where is everyone else?
• Movement planning:
How do I get from A to B?
• Driver state:
What’s the driver up to?

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Drive State Detection:
A Multi-Resolutional View
Gaze
Classification
Blink
Rate
Blink
Duration
Head
Pose
Eye
Pose
Pupil
Diameter
Micro
Saccades
Increasing level of detection resolution and difficulty
Body
Pose
Blink
Dynamics
Micro
Glances
Cognitive
Load
Drowsiness

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Gaze Region and Autopilot State

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Driver Emotion

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
If Driving is a Conversation, this is an
End-to-End Natural Language Generation
1. Natural language processing to enable
it to communicate successfully
2. Knowledge representation to store
information provided before or during
the interrogation
3. Automated reasoning to use the stored
information to answer questions and to
draw new conclusions
Turing Test:
Can a computer be mistaken for a
human more than 30% of the time?

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Autonomous Driving: End-to-End
Magic
Happens

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Stairway to Automation
Ford F150
Tesla Model S
Google Self-Driving Car
Training Dataset
Testing Dataset

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
• 9 layers
• 1 normalization layer
• 5 convolutional layers
• 3 fully connected layers
• 27 million connections
• 250 thousand parameters

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
End-to-End Driving with ConvnetJS
Tutorial on http://cars.mit.edu/deeptesla

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
End-to-end Steering
• By the end of this lecture, you’ll be able to train a model
that can steer a vehicle
• Our input to our network will be a single image of the
forward roadway from a Tesla
• Our output will be a steering wheel value between -20 and
20

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Creating the Dataset
• We recorded and extracted 10 video clips of highway driving
from a Tesla
• The wheel value was extracted from the in-vehicle CAN
• We cropped/extracted a window from each video frame and
provide a CSV linking the window to a wheel value

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Lighting and Road Conditions

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS Overview
• ConvNetJS is a Javascript
implementation for using
and training neural
networks within the
browser
• It supports simple networks
with several different layer
types and training
algorithms
• Constructing and training a
network can be performed
in very few lines of code,
great for demonstrations

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS – Neural Network Representation
• The network is represented
by a single Javascript object
which contains a list of
layers
• Each layer contains a plain
array of weights (w), the
activation/activation
gradients of the last
forward pass, as well as the
shape and layer type

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Layer Types
• ConvNetJS implements several different layer types:
convolutional, pooling, fully-connected, local contrast
normalization, and loss layers
• There are three available output types: regression, softmax,
and SVM

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS – Training Overview
• To train a network, you first must
initialize a “Trainer” object
• var trainer = new
convnetjs.SGDTrainer(net, {
method: ‘adadelta’, batch_size:
1, l2_decay: 0.0001});
• There are three training algorithms
available: SGD, Adadelta, and
Adagrad.
• Training is performed by manually
calling trainer.train(input_volume,
expected_output)
• Returns an object containing timing
and loss function information

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
DeepTesla Overview

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Model Metrics

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Network Designer

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Training Interaction

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Layer Visualization

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Input Layer

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Convolutional Layer Visualization

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Video Visualization

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Information Bar

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Input Box

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Barcodes
• 17 bit, sign-magnitude
• Encoded into actual video
• 0 = black, 1 = white
• Frame on top, wheel on
bottom (divided by two)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Image Batches
• Each image loaded of the
network contains an entire
batch
• There is one image per row,
and 250 rows in total
• These images are
reassembled into volumes
upon download

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Training Explanation
• One web worker used for loading examples
• Each batch of training images is one large image with each row as a
single training example
• After an image finishes loading asynchronously, sends the training
examples to another worker
• One web worker used for training network
• Train on each image and push the network/outputs to visualization
worker
• One web worker used for visualization
• For a specified training example interval, blit the activation/gradient
output of each training example onto a canvas
• Each web worker behaves as a single thread, and we use
message passing to communicate state between the
workers

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS Evaluation - Video Explanation
• The videos are encoded as 1280x820 in H264/MKV with 17
bit sign-magnitude barcodes
• The video main video frame is stored in the box (0, 1280, 0,
720)
• The frame barcode is in box (1144, 720, 1280, 770)
• The wheel value barcode is in box (1144, 770, 1280, 820)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS Evaluation - Creating the Video
• Each epoch is synchronized to 30 fps
• We extract the wheel value from the CAN data and
synchronize each message to a frame (both the frame and
the CAN message are timestamped)
• Using OpenCV, we process the data
• Generate a bar code for the frame containing the wheel data
• Crop the image portion used for training
• Create single images containing batches of training data
• The epochs and associated data are copied to our web
server which serves

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS Evaluation - Playing the Video
• To be able to use the video in the neural network, we need
to do some preprocessing
• First, we have a hidden video element and rely on modern
HTML5 video implementations
• When the user requests the video to play, we begin tracking
each redraw of the page
• With each redraw, we grab the currently rendered video
frame, extract the RGBA values and blit them to two
different canvases: one canvas, which the user sees, and
another canvas which is hidden and only contains a cropped
portion of the frame (the part we will use for the neural
network)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS Evaluation - Playing the Video
• Next, we read the image data from the hidden canvas and
shape it into a ConvNetJS volume
• For each image we first create a volume:
• var image_vol = new convnetjs.Vol(x_size, y_size, depth,
default_value)
• Next, we extract each pixel from the canvas and set the
equivalent voxel (volume pixel) to the value (skipping the
alpha value)
• We can also extract the expected steer value by parsing the
barcode (a 17 bit, signed-magnitude barcode, where white =
1, black = 0)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
ConvNetJS Evaluation - Forward Pass
• Now we can use our extracted volume in the forward pass
by calling net.forward(our_volume)
• The predicted value is stored in the output neuron:
• var prediction = net.forward(vol);
• var raw_regression_value = prediction.w[0];
• Because we min-max normalized our inputs while training
the network, we need to transform our outputs – this is just
the reverse transformation we performed on input:
• Wheel value = (raw_regression_value * total_wheel_range) –
wheel_min
• We visualize the predicted and actual steering wheel values
and calculate the error

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
End-to-End Driving with TensorFlow
Available on http://github.com/lexfridman/deeptesla

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Build the Model: Input and Output
def weight_variable(shape):
initial = tf.truncated_normal(shape, stddev=0.1)
return tf.Variable(initial)
def bias_variable(shape):
initial = tf.constant(0.1, shape=shape)
return tf.Variable(initial)
def conv2d(x, W, stride):
return tf.nn.conv2d(x, W, strides=[1, stride, stride, 1],
padding='VALID')
x = tf.placeholder(tf.float32, shape=[None, 66, 200, 3])
y_ = tf.placeholder(tf.float32, shape=[None, 1])
x_image = x

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Build the Model: Convolutional Layers
#first convolutional layer
W_conv1 = weight_variable([5, 5, 3, 24])
b_conv1 = bias_variable([24])
h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1, 2) + b_conv1)
#second convolutional layer
h_conv2 = tf.nn.relu(conv2d(h_conv1, W_conv2, 2) + b_conv2)
#third convolutional layer
#fourth convolutional layer
#fifth convolutional layer

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Build the Model: Fully Connected Layers
# fully connected layer 1
W_fc1 = weight_variable([1152, 1164])
b_fc1 = bias_variable([1164])
h_conv5_flat = tf.reshape(h_conv5, [-1, 1152])
h_fc1 = tf.nn.relu(tf.matmul(h_conv5_flat, W_fc1) + b_fc1)
keep_prob = tf.placeholder(tf.float32)
h_fc1_drop = tf.nn.dropout(h_fc1, keep_prob)
h_fc2 = tf.nn.relu(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)
#Output
y = tf.mul(tf.atan(tf.matmul(h_fc4_drop, W_fc5) + b_fc5), 2)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Train the Model
sess = tf.InteractiveSession()
loss = tf.reduce_mean(tf.square(tf.sub(model.y_, model.y)))
train_step = tf.train.AdamOptimizer(1e-4).minimize(loss)
sess.run(tf.initialize_all_variables())
saver = tf.train.Saver()
for i in range(int(driving_data.num_images * 0.3)):
xs, ys = driving_data.LoadTrainBatch(100)
train_step.run(feed_dict={model.x: xs, model.y_: ys, model.keep_prob: 0.8})
if i % 10 == 0:
xs, ys = driving_data.LoadValBatch(100)
print("step %d, val loss %g"%(i, loss.eval(feed_dict={
model.x:xs, model.y_: ys, model.keep_prob: 1.0})))
if i % 100 == 0:
if not os.path.exists(LOGDIR):
os.makedirs(LOGDIR)
checkpoint_path = os.path.join(LOGDIR, "model.ckpt")
filename = saver.save(sess, checkpoint_path)
print("Model saved in file: %s" % filename)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Run the Model
import tensorflow as tf
import scipy.misc
import model
import cv2
sess = tf.InteractiveSession()
saver = tf.train.Saver()
saver.restore(sess, "save/model.ckpt")
img = cv2.imread('steering_wheel_image.jpg',0)
rows,cols = img.shape
smoothed_angle = 0
cap = cv2.VideoCapture(0)
while(cv2.waitKey(10) != ord('q')):
ret, frame = cap.read()
image = scipy.misc.imresize(frame, [66, 200]) / 255.0
degrees = model.y.eval(feed_dict={model.x: [image], model.keep_prob:
1.0})[0][0]
* 180 / scipy.pi
cv2.imshow('frame', frame)
smoothed_angle += 0.2 * pow(abs((degrees - smoothed_angle)), 2.0 / 3.0) *
(degrees - smoothed_angle) / abs(degrees - smoothed_angle)
M = cv2.getRotationMatrix2D((cols/2,rows/2),-smoothed_angle,1)
dst = cv2.warpAffine(img,M,(cols,rows))
cv2.imshow("steering wheel", dst)
cap.release()
cv2.destroyAllWindows()

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
TrafficLight Classification with TensorFlow
We will be implementing a simple traffic light classifier, with 3 classes (red, green, yellow)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Parameters
• Max epochs: the number of
times the neural network
will see all training
examples
• Input_img_x/y: the size we
will use for inputs into the
the network
• Batch size: # of examples
the neural network will see
before making a gradient
step

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Helper Functions
We use some helper functions to make adding layers
easier/more consistent

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Model Input/Output
• We specify our input and output types in the same lines to
make sure they agree with our idea of the network
• Our input is an image of sized 32x32x3 (RGB channels)
• Our output consists of 3 neurons, representing the
probability of each class

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Convolutional Layer
• Here we specify our first convolutional layer using our helper
function
• W_conv1 – a 4D tensor representing the weights [filter_x,
filter_y, previous layer neurons, # of filters]
• b_conv1 – our simple addition variable
• h_conv1 – our actual layer/activation

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Pooling Layer

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Flattening Pool Layer
We calculate the total number of neurons needed in our first fully-connected
layer by multiplying all the dimensions of the pool layer shape

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Output Layer

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Loss and Optimizer
• Our loss function performs softmax and then computes
cross-entropy
• We use the AdamOptimizer and specify a learning rate

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Saver Object

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Loading Images
• Iterate over each image, resize to 32x32
• Create a one hot encoding of our class
• Shuffle the entire dataset

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Splitting the Dataset
• Split our data set into train and test
• We truncate our sets to a multiple of batch size (all batches
have to be the same size)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Training Loop
• Iterate over each batch and train on it
• (we assume training examples are a multiple of the batch size)

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Best Model
• We evaluate the loss on all of our training examples and test
examples
• If the validation loss is lower than the lowest loss, we save our
model

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
Expected Output

Lex Fridman:
fridman@mit.edu
Website:
cars.mit.edu
January
2017
Course 6.S094:
References
All references cited in this presentation are listed in the
following Google Sheets file:
https://goo.gl/9Xhp2t

Convolutional Neural Networks(CNN) and Computer Vision

More Related Content

Similar to Convolutional Neural Networks(CNN) and Computer Vision

Recently uploaded

Convolutional Neural Networks(CNN) and Computer Vision