You Only Look Once (YOLO): Unified,
Real-Time Object Detection
Presenter: Shivang Singh
Sept 2nd, 2021
CS391R: Robot Learning (Fall 2021) 1
Problem Addressed: Object Detection
âť– Object detection is the problem of both
locating AND classifying objects
âť– Goal of YOLO algorithm is to do object
detection both fast AND with high
accuracy
Object Detection vs Classification
“Deep Learning for Vision Systems” (Elgendy)
CS391R: Robot Learning (Fall 2021) 2
Importance of Object Detection for Robotics
âť–
âť–
Visual modality is very powerful
Humans are able to detect objects and do
perception using just this modality in real time
(not needing radar)
If we want responsive robot systems that
work in real time (without specialized
sensors) almost real time vision based object
detection can help greatly
Vision based vs LIDAR (self driving)
âť–
Tesla Investor Day Presentation
CS391R: Robot Learning (Fall 2021) 3
Previous Object Detection Paradigm
This pipeline was used in nearly all SOTA Object Detection prior: Label +
confidence
hat - 0.92
racket - 0.2
ball - 0.23
Step 2: Run the bounding box through a
classifier
Step 1: Scan the image to generate candidate
bounding boxes
Step 3: Conduct post-processing (filtering out
redundant bounding boxes)
Diagram developed by presenter
CS391R: Robot Learning (Fall 2021) 4
Image
Classifier
Key Insights
Previous Approaches YOLO algorithm
âť– A single neural network for
localization and for classification
(less complicated pipeline)
Need to inference only once
(efficient computation)
Looks at the entire image each time
leading to less false positives (has
contextual information for detection)
âť– A separate model for generating
bounding boxes and for classification
(more complicated model pipeline)
Need to run classification many
times (expensive computation)
Looks at limited part of the image
(lacks contextual information for
detection)
âť–
âť–
âť–
âť–
CS391R: Robot Learning (Fall 2021) 5
Formal Problem Setting
âť– Given an image generate bounding boxes, one for
each detectable object in image
For each bounding box, output 5 predictions: x, y, w,
h, confidence. Also output class
âť–
âť–
âť–
âť–
âť–
x, y (coordinates for center of bounding box)
w,h (width and height)
confidence (probability bounding box has
object)
class (classification of object in bounding box)
CS391R: Robot Learning (Fall 2021) 6
Related Work
- R-CNN or Region Based Convolutional Network (Girshick et al. 2014):
- Used the sliding window approach from earlier, with Selective Search, a smarter way to
select
candidates (which means there is less computation) Still
feeds a limited part of the image to the classifier
Drawbacks: Large pipeline, slow, too many false positives
-
-
- Fast and Faster R-CNN:
-
-
Optimize parts of the pipeline described earlier
Drawbacks: loses accuracy
- Deep Multibox (Szegedy et. al 2014):
-
-
Train a CNN to find areas of interest
Drawbacks: Doesn’t address classification only
localization
CS391R: Robot Learning (Fall 2021) 7
Related Work
- MultiGrasp (Redmon et. al 2014)
-
-
Similar to YOLO
A much simpler task (only needs to predict object not multiple objects)
CS391R: Robot Learning (Fall 2021) 8
YOLO overview
âť–
âť–
âť–
First, image is split into a SxS grid
For each grid square, generate B bounding boxes
For each bounding box, there are 5 predictions: x, y, w,
h, confidence
S = 3, B = 2
CS391R: Robot Learning (Fall 2021) 9
YOLO Training For each grid block, we have a
vector like this. For this
example
B is 2 and C is 2
âť– YOLO is a regression algorithm. What is
X? What is Y?
X is simple, just an image width (in
pixels) * height (in pixels) * RGB values
Y is a tensor of size S * S * (B * 5 + C)
B*5 + C term represents the predictions
+ class predicted distribution for a grid
block
âť–
âť–
âť–
GT label
example:
CS391R: Robot Learning (Fall 2021) 10
YOLO Architecture
- Now that we know the input and
output, we can discuss the
model
- We are given 448 by 448 by
our input.
Implementation uses 7
convolution layers
3 as
-
- Paper parameters: S = 7, B = 2,
C = 20
Output is S*S*(5B+C) =
7*7*(5*2+20) = 7*7*30
-
CS391R: Robot Learning (Fall 2021) 11
YOLO Prediction
âť–
âť–
We then use the output to make final detections
Use a threshold to filter out bounding boxes
with
low P(Object)
In order to know the class for the bounding box
compute score take argmax over the distribution
Pr(Class|Object) for the grid the bounding box’s
center is in
âť–
CS391R: Robot Learning (Fall 2021) 12
Non-maximal suppression
âť– Most of the time objects fall in one grid,
however it is still possible to get redundant
boxes (rare case as object must be close to
multiple grid cells for this to happen)
âť– Discard bounding box with high overlap
(keeping the bounding box with highest
confidence)
âť– Adds 2-3% on final mAP score
CS391R: Robot Learning (Fall 2021) 13
YOLO Objective Function
âť– For YOLO, we need to minimize the following loss
âť– Sum squared error is used
Coordinate Loss: Minimize the difference
between x,y,w,h pred and x,y,w,h ground
truth. ONLY IF object exists in grid box
and if bounding box is resp for pred
Loss: Loss based on confidence ONLY IF
ct
ct Loss based on confidence if there is no
Class loss, minimize loss between true
class of object in grid box
CS391R: Robot Learning (Fall 2021) 14
Confidence
there is obje
No Obje
object
Experimental Setup
âť– Authors compare YOLO against the previous work described above on PASCAL VOC 2007, and
VOC 2012 as well as out of domain art dataset
âť– Correct if IOU metric above .5 and class is correct
âť– Use two performance metrics:
➢ mAP score: mean average precision
➢ FPS: frames per second
âť– Add FAST YOLO: which has less parameters
CS391R: Robot Learning (Fall 2021) 15
Experimental Results
âť– Baseline YOLO outperform
real time detectors by large
amount
âť– Do better than most less
than
real time as well
CS391R: Robot Learning (Fall 2021) 16
Experimental Results
CS391R: Robot Learning (Fall 2021) 17
Experimental Results - Error Analysis
- Makes far less background
errors (less likely to predict
false positives on background)
IOU is VERY small with any
ground truth label
But far more localization
errors
-
- Correct class, IOU is somewhat
small Localization error Background error
CS391R: Robot Learning (Fall 2021) 18
Experimental Results - Out of Domain
âť– Ran YOLO + competitors
(trained on natural images)
on art
âť– Does well on artistic
datasets where more
having global context
greatly helps
CS391R: Robot Learning (Fall 2021) 19
Discussion of Results
âť– Pro: YOLO is a lot faster than the other algorithms for image detection
❖ Pro: YOLO’s use of global information rather than only local information allows it to
understand
contextual information when doing object detection
➢ Does better in domains such as artwork due to this
âť– Con: YOLO lagged behind the SOTA models in object detection
➢ This is attributed to making many localization errors and unable to detect small object
CS391R: Robot Learning (Fall 2021) 20
Critique / Limitations / Open Issues
âť– Performance lags behind SOTA
âť– Requires data to be labeled with bounding boxes, hard to collect for many classes
➢ Previous work could generalize better since it used image classifier
➢ 2014 COCO dataset (very large dataset) addressed this somewhat
âť– Regarding experiments: number of classes predicted is very limited
➢ Not convinced that YOLO v1 is generalizable
âť– Confidence output of YOLO not confidence of class but P(Object), lowers interpretability
âť– Another limitation of YOLO is that it imposed spatial constraints on the objects in the image
since
only B boxes can be predicted on an SxS grid
âť– Since the architecture only predicts boxes, this might make it less useful for irregular shapes
CS391R: Robot Learning (Fall 2021) 21
Future Work for Paper / Reading
âť– One extension of this work would be to look
at image segmentation and see if the
insights carry over
â—‹ YOLOACT (Boyla et al 2019): Real
time image segmentation
âť– YOLO has been upgraded 2 times
â—‹ Solves a lot of issues relating to
detecting small objects,
generalizability, and localization
YOLOACT example
CS391R: Robot Learning (Fall 2021) 22
Extended Readings
âť– YOLO v2 (https://arxiv.org/abs/1506.02640) (extends on the work greatly) (Redmond et al 2016)
➢ Deals with the generalizability problem, has 9000 classes
➢ Class probability distribution per bounding box, not per grid
➢ High resolution classifier (finetune on high resolution)
➢ Batch norm
➢ Trained on MSCOCO (released after YOLO v1 paper)
âť– YOLO v3 (https://arxiv.org/abs/1804.02767)
➢ “Incremental Improvement”
➢ Uses independent logistic classifiers for class
â–  Allows for more specificity in classes
CS391R: Robot Learning (Fall 2021) 23
Summary
âť–
âť–
âť–
âť–
Object detection is the problem of detecting multiple objects in an image
Almost real time object detection can make highly responsive robot systems without complex
sensors
Prior work relies on a large architecture with numerous parts to optimize
YOLO proposes a unified architecture, which does all the tasks in one model and by one
inference
over the entire image
They show enormous speed improvement and show that they can beat most other prior work in
terms
of mAPs
âť–
CS391R: Robot Learning (Fall 2021) 24

Classification of Object Detection Algorithms

  • 1.
    You Only LookOnce (YOLO): Unified, Real-Time Object Detection Presenter: Shivang Singh Sept 2nd, 2021 CS391R: Robot Learning (Fall 2021) 1
  • 2.
    Problem Addressed: ObjectDetection ❖ Object detection is the problem of both locating AND classifying objects ❖ Goal of YOLO algorithm is to do object detection both fast AND with high accuracy Object Detection vs Classification “Deep Learning for Vision Systems” (Elgendy) CS391R: Robot Learning (Fall 2021) 2
  • 3.
    Importance of ObjectDetection for Robotics âť– âť– Visual modality is very powerful Humans are able to detect objects and do perception using just this modality in real time (not needing radar) If we want responsive robot systems that work in real time (without specialized sensors) almost real time vision based object detection can help greatly Vision based vs LIDAR (self driving) âť– Tesla Investor Day Presentation CS391R: Robot Learning (Fall 2021) 3
  • 4.
    Previous Object DetectionParadigm This pipeline was used in nearly all SOTA Object Detection prior: Label + confidence hat - 0.92 racket - 0.2 ball - 0.23 Step 2: Run the bounding box through a classifier Step 1: Scan the image to generate candidate bounding boxes Step 3: Conduct post-processing (filtering out redundant bounding boxes) Diagram developed by presenter CS391R: Robot Learning (Fall 2021) 4 Image Classifier
  • 5.
    Key Insights Previous ApproachesYOLO algorithm âť– A single neural network for localization and for classification (less complicated pipeline) Need to inference only once (efficient computation) Looks at the entire image each time leading to less false positives (has contextual information for detection) âť– A separate model for generating bounding boxes and for classification (more complicated model pipeline) Need to run classification many times (expensive computation) Looks at limited part of the image (lacks contextual information for detection) âť– âť– âť– âť– CS391R: Robot Learning (Fall 2021) 5
  • 6.
    Formal Problem Setting âť–Given an image generate bounding boxes, one for each detectable object in image For each bounding box, output 5 predictions: x, y, w, h, confidence. Also output class âť– âť– âť– âť– âť– x, y (coordinates for center of bounding box) w,h (width and height) confidence (probability bounding box has object) class (classification of object in bounding box) CS391R: Robot Learning (Fall 2021) 6
  • 7.
    Related Work - R-CNNor Region Based Convolutional Network (Girshick et al. 2014): - Used the sliding window approach from earlier, with Selective Search, a smarter way to select candidates (which means there is less computation) Still feeds a limited part of the image to the classifier Drawbacks: Large pipeline, slow, too many false positives - - - Fast and Faster R-CNN: - - Optimize parts of the pipeline described earlier Drawbacks: loses accuracy - Deep Multibox (Szegedy et. al 2014): - - Train a CNN to find areas of interest Drawbacks: Doesn’t address classification only localization CS391R: Robot Learning (Fall 2021) 7
  • 8.
    Related Work - MultiGrasp(Redmon et. al 2014) - - Similar to YOLO A much simpler task (only needs to predict object not multiple objects) CS391R: Robot Learning (Fall 2021) 8
  • 9.
    YOLO overview âť– âť– âť– First, imageis split into a SxS grid For each grid square, generate B bounding boxes For each bounding box, there are 5 predictions: x, y, w, h, confidence S = 3, B = 2 CS391R: Robot Learning (Fall 2021) 9
  • 10.
    YOLO Training Foreach grid block, we have a vector like this. For this example B is 2 and C is 2 âť– YOLO is a regression algorithm. What is X? What is Y? X is simple, just an image width (in pixels) * height (in pixels) * RGB values Y is a tensor of size S * S * (B * 5 + C) B*5 + C term represents the predictions + class predicted distribution for a grid block âť– âť– âť– GT label example: CS391R: Robot Learning (Fall 2021) 10
  • 11.
    YOLO Architecture - Nowthat we know the input and output, we can discuss the model - We are given 448 by 448 by our input. Implementation uses 7 convolution layers 3 as - - Paper parameters: S = 7, B = 2, C = 20 Output is S*S*(5B+C) = 7*7*(5*2+20) = 7*7*30 - CS391R: Robot Learning (Fall 2021) 11
  • 12.
    YOLO Prediction ❖ ❖ We thenuse the output to make final detections Use a threshold to filter out bounding boxes with low P(Object) In order to know the class for the bounding box compute score take argmax over the distribution Pr(Class|Object) for the grid the bounding box’s center is in ❖ CS391R: Robot Learning (Fall 2021) 12
  • 13.
    Non-maximal suppression âť– Mostof the time objects fall in one grid, however it is still possible to get redundant boxes (rare case as object must be close to multiple grid cells for this to happen) âť– Discard bounding box with high overlap (keeping the bounding box with highest confidence) âť– Adds 2-3% on final mAP score CS391R: Robot Learning (Fall 2021) 13
  • 14.
    YOLO Objective Function âť–For YOLO, we need to minimize the following loss âť– Sum squared error is used Coordinate Loss: Minimize the difference between x,y,w,h pred and x,y,w,h ground truth. ONLY IF object exists in grid box and if bounding box is resp for pred Loss: Loss based on confidence ONLY IF ct ct Loss based on confidence if there is no Class loss, minimize loss between true class of object in grid box CS391R: Robot Learning (Fall 2021) 14 Confidence there is obje No Obje object
  • 15.
    Experimental Setup ❖ Authorscompare YOLO against the previous work described above on PASCAL VOC 2007, and VOC 2012 as well as out of domain art dataset ❖ Correct if IOU metric above .5 and class is correct ❖ Use two performance metrics: ➢ mAP score: mean average precision ➢ FPS: frames per second ❖ Add FAST YOLO: which has less parameters CS391R: Robot Learning (Fall 2021) 15
  • 16.
    Experimental Results âť– BaselineYOLO outperform real time detectors by large amount âť– Do better than most less than real time as well CS391R: Robot Learning (Fall 2021) 16
  • 17.
    Experimental Results CS391R: RobotLearning (Fall 2021) 17
  • 18.
    Experimental Results -Error Analysis - Makes far less background errors (less likely to predict false positives on background) IOU is VERY small with any ground truth label But far more localization errors - - Correct class, IOU is somewhat small Localization error Background error CS391R: Robot Learning (Fall 2021) 18
  • 19.
    Experimental Results -Out of Domain âť– Ran YOLO + competitors (trained on natural images) on art âť– Does well on artistic datasets where more having global context greatly helps CS391R: Robot Learning (Fall 2021) 19
  • 20.
    Discussion of Results ❖Pro: YOLO is a lot faster than the other algorithms for image detection ❖ Pro: YOLO’s use of global information rather than only local information allows it to understand contextual information when doing object detection ➢ Does better in domains such as artwork due to this ❖ Con: YOLO lagged behind the SOTA models in object detection ➢ This is attributed to making many localization errors and unable to detect small object CS391R: Robot Learning (Fall 2021) 20
  • 21.
    Critique / Limitations/ Open Issues ❖ Performance lags behind SOTA ❖ Requires data to be labeled with bounding boxes, hard to collect for many classes ➢ Previous work could generalize better since it used image classifier ➢ 2014 COCO dataset (very large dataset) addressed this somewhat ❖ Regarding experiments: number of classes predicted is very limited ➢ Not convinced that YOLO v1 is generalizable ❖ Confidence output of YOLO not confidence of class but P(Object), lowers interpretability ❖ Another limitation of YOLO is that it imposed spatial constraints on the objects in the image since only B boxes can be predicted on an SxS grid ❖ Since the architecture only predicts boxes, this might make it less useful for irregular shapes CS391R: Robot Learning (Fall 2021) 21
  • 22.
    Future Work forPaper / Reading âť– One extension of this work would be to look at image segmentation and see if the insights carry over â—‹ YOLOACT (Boyla et al 2019): Real time image segmentation âť– YOLO has been upgraded 2 times â—‹ Solves a lot of issues relating to detecting small objects, generalizability, and localization YOLOACT example CS391R: Robot Learning (Fall 2021) 22
  • 23.
    Extended Readings ❖ YOLOv2 (https://arxiv.org/abs/1506.02640) (extends on the work greatly) (Redmond et al 2016) ➢ Deals with the generalizability problem, has 9000 classes ➢ Class probability distribution per bounding box, not per grid ➢ High resolution classifier (finetune on high resolution) ➢ Batch norm ➢ Trained on MSCOCO (released after YOLO v1 paper) ❖ YOLO v3 (https://arxiv.org/abs/1804.02767) ➢ “Incremental Improvement” ➢ Uses independent logistic classifiers for class ■ Allows for more specificity in classes CS391R: Robot Learning (Fall 2021) 23
  • 24.
    Summary âť– âť– âť– âť– Object detection isthe problem of detecting multiple objects in an image Almost real time object detection can make highly responsive robot systems without complex sensors Prior work relies on a large architecture with numerous parts to optimize YOLO proposes a unified architecture, which does all the tasks in one model and by one inference over the entire image They show enormous speed improvement and show that they can beat most other prior work in terms of mAPs âť– CS391R: Robot Learning (Fall 2021) 24