Classification of Object Detection Algorithms

You Only Look Once (YOLO): Unified,
Real-Time Object Detection
Presenter: Shivang Singh
Sept 2nd, 2021
CS391R: Robot Learning (Fall 2021) 1

Problem Addressed: Object Detection
❖ Object detection is the problem of both
locating AND classifying objects
❖ Goal of YOLO algorithm is to do object
detection both fast AND with high
accuracy
Object Detection vs Classification
“Deep Learning for Vision Systems” (Elgendy)

Importance of Object Detection for Robotics
❖
❖
Visual modality is very powerful
Humans are able to detect objects and do
perception using just this modality in real time
(not needing radar)
If we want responsive robot systems that
work in real time (without specialized
sensors) almost real time vision based object
detection can help greatly
Vision based vs LIDAR (self driving)
❖
Tesla Investor Day Presentation

Previous Object Detection Paradigm
This pipeline was used in nearly all SOTA Object Detection prior: Label +
confidence
hat - 0.92
racket - 0.2
ball - 0.23
Step 2: Run the bounding box through a
classifier
Step 1: Scan the image to generate candidate
bounding boxes
Step 3: Conduct post-processing (filtering out
redundant bounding boxes)
Diagram developed by presenter
Image
Classifier

Key Insights
Previous Approaches YOLO algorithm
❖ A single neural network for
localization and for classification
(less complicated pipeline)
Need to inference only once
(efficient computation)
Looks at the entire image each time
leading to less false positives (has
contextual information for detection)
❖ A separate model for generating
bounding boxes and for classification
(more complicated model pipeline)
Need to run classification many
times (expensive computation)
Looks at limited part of the image
(lacks contextual information for
detection)
❖
❖
❖
❖

Formal Problem Setting
❖ Given an image generate bounding boxes, one for
each detectable object in image
For each bounding box, output 5 predictions: x, y, w,
h, confidence. Also output class
❖
❖
❖
❖
❖
x, y (coordinates for center of bounding box)
w,h (width and height)
confidence (probability bounding box has
object)
class (classification of object in bounding box)

Related Work
- R-CNN or Region Based Convolutional Network (Girshick et al. 2014):
- Used the sliding window approach from earlier, with Selective Search, a smarter way to
select
candidates (which means there is less computation) Still
feeds a limited part of the image to the classifier
Drawbacks: Large pipeline, slow, too many false positives
-
-
- Fast and Faster R-CNN:
-
-
Optimize parts of the pipeline described earlier
Drawbacks: loses accuracy
- Deep Multibox (Szegedy et. al 2014):
-
-
Train a CNN to find areas of interest
Drawbacks: Doesn’t address classification only
localization

Related Work
- MultiGrasp (Redmon et. al 2014)
-
-
Similar to YOLO
A much simpler task (only needs to predict object not multiple objects)

YOLO overview
❖
❖
❖
First, image is split into a SxS grid
For each grid square, generate B bounding boxes
For each bounding box, there are 5 predictions: x, y, w,
h, confidence
S = 3, B = 2

YOLO Training For each grid block, we have a
vector like this. For this
example
B is 2 and C is 2
❖ YOLO is a regression algorithm. What is
X? What is Y?
X is simple, just an image width (in
pixels) * height (in pixels) * RGB values
Y is a tensor of size S * S * (B * 5 + C)
B*5 + C term represents the predictions
+ class predicted distribution for a grid
block
❖
❖
❖
GT label
example:

YOLO Architecture
- Now that we know the input and
output, we can discuss the
model
- We are given 448 by 448 by
our input.
Implementation uses 7
convolution layers
3 as
-
- Paper parameters: S = 7, B = 2,
C = 20
Output is S*S*(5B+C) =
7*7*(5*2+20) = 7*7*30
-

YOLO Prediction
❖
❖
We then use the output to make final detections
Use a threshold to filter out bounding boxes
with
low P(Object)
In order to know the class for the bounding box
compute score take argmax over the distribution
Pr(Class|Object) for the grid the bounding box’s
center is in
❖

Non-maximal suppression
❖ Most of the time objects fall in one grid,
however it is still possible to get redundant
boxes (rare case as object must be close to
multiple grid cells for this to happen)
❖ Discard bounding box with high overlap
(keeping the bounding box with highest
confidence)
❖ Adds 2-3% on final mAP score

YOLO Objective Function
❖ For YOLO, we need to minimize the following loss
❖ Sum squared error is used
Coordinate Loss: Minimize the difference
between x,y,w,h pred and x,y,w,h ground
truth. ONLY IF object exists in grid box
and if bounding box is resp for pred
Loss: Loss based on confidence ONLY IF
ct
ct Loss based on confidence if there is no
Class loss, minimize loss between true
class of object in grid box
Confidence
there is obje
No Obje
object

Experimental Setup
❖ Authors compare YOLO against the previous work described above on PASCAL VOC 2007, and
VOC 2012 as well as out of domain art dataset
❖ Correct if IOU metric above .5 and class is correct
❖ Use two performance metrics:
➢ mAP score: mean average precision
➢ FPS: frames per second
❖ Add FAST YOLO: which has less parameters

Experimental Results
❖ Baseline YOLO outperform
real time detectors by large
amount
❖ Do better than most less
than
real time as well

Experimental Results

Experimental Results - Error Analysis
- Makes far less background
errors (less likely to predict
false positives on background)
IOU is VERY small with any
ground truth label
But far more localization
errors
-
- Correct class, IOU is somewhat
small Localization error Background error

Experimental Results - Out of Domain
❖ Ran YOLO + competitors
(trained on natural images)
on art
❖ Does well on artistic
datasets where more
having global context
greatly helps

Discussion of Results
❖ Pro: YOLO is a lot faster than the other algorithms for image detection
❖ Pro: YOLO’s use of global information rather than only local information allows it to
understand
contextual information when doing object detection
➢ Does better in domains such as artwork due to this
❖ Con: YOLO lagged behind the SOTA models in object detection
➢ This is attributed to making many localization errors and unable to detect small object

Critique / Limitations / Open Issues
❖ Performance lags behind SOTA
❖ Requires data to be labeled with bounding boxes, hard to collect for many classes
➢ Previous work could generalize better since it used image classifier
➢ 2014 COCO dataset (very large dataset) addressed this somewhat
❖ Regarding experiments: number of classes predicted is very limited
➢ Not convinced that YOLO v1 is generalizable
❖ Confidence output of YOLO not confidence of class but P(Object), lowers interpretability
❖ Another limitation of YOLO is that it imposed spatial constraints on the objects in the image
since
only B boxes can be predicted on an SxS grid
❖ Since the architecture only predicts boxes, this might make it less useful for irregular shapes

Future Work for Paper / Reading
❖ One extension of this work would be to look
at image segmentation and see if the
insights carry over
○ YOLOACT (Boyla et al 2019): Real
time image segmentation
❖ YOLO has been upgraded 2 times
○ Solves a lot of issues relating to
detecting small objects,
generalizability, and localization
YOLOACT example

Extended Readings
❖ YOLO v2 (https://arxiv.org/abs/1506.02640) (extends on the work greatly) (Redmond et al 2016)
➢ Deals with the generalizability problem, has 9000 classes
➢ Class probability distribution per bounding box, not per grid
➢ High resolution classifier (finetune on high resolution)
➢ Batch norm
➢ Trained on MSCOCO (released after YOLO v1 paper)
❖ YOLO v3 (https://arxiv.org/abs/1804.02767)
➢ “Incremental Improvement”
➢ Uses independent logistic classifiers for class
■ Allows for more specificity in classes

Summary
❖
❖
❖
❖
Object detection is the problem of detecting multiple objects in an image
Almost real time object detection can make highly responsive robot systems without complex
sensors
Prior work relies on a large architecture with numerous parts to optimize
YOLO proposes a unified architecture, which does all the tasks in one model and by one
inference
over the entire image
They show enormous speed improvement and show that they can beat most other prior work in
terms
of mAPs
❖

Classification of Object Detection Algorithms

More Related Content

Similar to Classification of Object Detection Algorithms

Recently uploaded

Classification of Object Detection Algorithms