You Only Look Once (YOLO):
Unified Real-Time Object Detection
Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
University of Washington, Allen Institute for AI, Facebook AI Research
~ Ashish
Previously : Object Detection by Classifiers
● DPM (Deformable Parts Model)
○ Sliding window → classifier (evenly spaced locations)
● R-CNN
○ Region proposal --> potential BB
○ Run classifiers on BB
○ Post processing (refinement, eliminate, rescore)
● YOLO
○ Resize image, run convolutional network, non-max suppression
YOLO : Object Detection as Regression Problem
● output: Bounding box coordinates and Class Probabilities
● Single Neural Network
● Benefits:
○ Extremely Fast (one NN + 45 frames per sec), twice more mAP.
○ Global Reasoning (knows context, less background errors)
○ Generalizable Representations (train natural images, test art-work, applicable new domain)
Unified Detection
● Feature Extraction
○ Predict all class BB simultaneously
● SxS Grid
○ Each cell predicts B bounding boxes + Confidence Score
● Confidence Score
○ Confidence is IOU between predicted box and any ground truth box =
● Class Probability
● Tensor
Detection Process (YOLO) Grid SXS
S = 7
Confidence Score
Each grid cell predicts B bounding boxes and confidence scores for those boxes.
If a cell has an object , then confidence score = Intersection over union (IOU)
between the predicted box and the ground truth.
Detection Process (YOLO)
Each cell predicts B boxes(x,y,w,h) and
confidences of each box: P(Object)
.(x,y)
w
h
B = 2
Prob. that box contains an
object P1, P2
No
Object
Each cell predicts Bounding Boxes and Confidence
.(x,y)
Each cell also predicts class probability
Bicycle
Dog
Car
E.g. Dog :
0.8
Car : 0
Bicycle : 0
E.g. Dog : 0
Car : 0
Bicycle : 0.7
E.g. Dog : 0
Car :
0.7
Bicycle : 0
Bounding Boxes + Class Prediction
.(x,y)
P (class) = P (class|object) x P(object) Thresholding
Model
These predictions are encoded
as Tensor of dimension
(SxSx(Bx5+C))
SxS grid,
C = class probability,
B= no of bounding boxes.
Network Design
● Inspired by the GoogLeNet (image classification)
● 24 convolutional layers followed by 2 fully connected layers
● Fast YOLO uses 9 convolutional layers (instead of 24)
Training
1. Pretrain on ImageNet 1000 dataset
2. 20 convolutional layers + an average pooling layer + a fully connected layer
3. Trained for 1 week, accuracy 88% (ImageNet 2012 validation dataset)
4. Convert model to perform detection
5. Added 4 convolutional layer + 2 fully connected layer + increased input resolution from 224 x 224 to
448 x 448.
6. Final layer predicts class probabilities + BB.
7. Linear activation function (final layer), Relu (all other layers)
8. Sum of squared error as loss function (easy to optimise)
Loss Function
Training - Validation
1. Train network for 135 epochs on the training and validation data sets from PASCAL
VOC 2007 AND 2012
2. Testing data VOC 2007 & 2012
3. Batch size = 64, momentum = 0.9, decay = 0.0005
4. Learning rate :
a. First few epochs , raise LR 10^-3 to 10^-2
b. Model diverges if starting LR is high due to unstable gradient
c. first 75 epoch, LR 10^-2
d. next 30 epochs, LR 10^-3
e. next 30 epochs, LR 10^-4
5. To avoid overfitting:
a. Dropout layer with rate 0.5
b. For Data Augmentation, scaling and translation up to 20% of original image size
Inference
● On PASCAL VOC YOLO predicts 98 BB per image and class probability for
each box.
● Objects near border are localised by multiple cells
○ Non Maximal suppression can be used to fix these multiple detections (Non-max suppression is a
way to eliminate points that do not lie in important edges. )
■ Adds 2 to 3% to mAP
Limitation of YOLO
● Struggle with small objects
● Struggles with difference aspects and ratio of objects
● Loss function treats error in different size of boxes same
Comparison with other Real time Systems:
● DPM : disjoint pipeline (sliding window, features, classify, predict BB) -
YOLO concurrently
● R-CNN : region proposal , complex pipeline ( predict bb, extract
features, non-max suppression) - 40 sec per image (2000 BB) : YOLO
: 98 BB
● Deep Multibox : cnn, cannot do general detection
● OverFeat : cnn, disjoint system, no global context
● MultiGrasp : similar in design (YOLO) , only find a region
Experiments
● PASCAL VOC
2007
● Realtime :
○ YOLO VS DPM 30
Hz
VOC 2007 Error Analysis
Combining Fast R-CNN and YOLO
● YOLO makes fewer background
mistakes than Fast R-CNN
● This combination doesn’t benefit
from the speed of YOLO since
each model is run separately and
then combine the results.
VOC 2012 Results
● YOLO struggles with small objects (bottle, sheep, tv/monitor)
● Fast R-CNN + YOLO : Highest performing detection methods
Generalizability: Person Detection in Artwork
● YOLO has good performance on VOC 2007
● Its AP degrades less than other methods when applied to artwork.
● Artwork / Natural Images are very different on a pixel level but very similar in terms of size and
shape, so YOLO predicts good bounding boxes and detections.
Results
Darknet (YOLO) Results on random images

You only look once (YOLO) : unified real time object detection

  • 1.
    You Only LookOnce (YOLO): Unified Real-Time Object Detection Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi University of Washington, Allen Institute for AI, Facebook AI Research ~ Ashish
  • 2.
    Previously : ObjectDetection by Classifiers ● DPM (Deformable Parts Model) ○ Sliding window → classifier (evenly spaced locations) ● R-CNN ○ Region proposal --> potential BB ○ Run classifiers on BB ○ Post processing (refinement, eliminate, rescore) ● YOLO ○ Resize image, run convolutional network, non-max suppression
  • 3.
    YOLO : ObjectDetection as Regression Problem ● output: Bounding box coordinates and Class Probabilities ● Single Neural Network ● Benefits: ○ Extremely Fast (one NN + 45 frames per sec), twice more mAP. ○ Global Reasoning (knows context, less background errors) ○ Generalizable Representations (train natural images, test art-work, applicable new domain)
  • 4.
    Unified Detection ● FeatureExtraction ○ Predict all class BB simultaneously ● SxS Grid ○ Each cell predicts B bounding boxes + Confidence Score ● Confidence Score ○ Confidence is IOU between predicted box and any ground truth box = ● Class Probability ● Tensor
  • 5.
  • 6.
    Confidence Score Each gridcell predicts B bounding boxes and confidence scores for those boxes. If a cell has an object , then confidence score = Intersection over union (IOU) between the predicted box and the ground truth.
  • 7.
    Detection Process (YOLO) Eachcell predicts B boxes(x,y,w,h) and confidences of each box: P(Object) .(x,y) w h B = 2 Prob. that box contains an object P1, P2 No Object
  • 8.
    Each cell predictsBounding Boxes and Confidence .(x,y)
  • 9.
    Each cell alsopredicts class probability Bicycle Dog Car E.g. Dog : 0.8 Car : 0 Bicycle : 0 E.g. Dog : 0 Car : 0 Bicycle : 0.7 E.g. Dog : 0 Car : 0.7 Bicycle : 0
  • 10.
    Bounding Boxes +Class Prediction .(x,y) P (class) = P (class|object) x P(object) Thresholding
  • 11.
    Model These predictions areencoded as Tensor of dimension (SxSx(Bx5+C)) SxS grid, C = class probability, B= no of bounding boxes.
  • 12.
    Network Design ● Inspiredby the GoogLeNet (image classification) ● 24 convolutional layers followed by 2 fully connected layers ● Fast YOLO uses 9 convolutional layers (instead of 24)
  • 13.
    Training 1. Pretrain onImageNet 1000 dataset 2. 20 convolutional layers + an average pooling layer + a fully connected layer 3. Trained for 1 week, accuracy 88% (ImageNet 2012 validation dataset) 4. Convert model to perform detection 5. Added 4 convolutional layer + 2 fully connected layer + increased input resolution from 224 x 224 to 448 x 448. 6. Final layer predicts class probabilities + BB. 7. Linear activation function (final layer), Relu (all other layers) 8. Sum of squared error as loss function (easy to optimise)
  • 14.
  • 15.
    Training - Validation 1.Train network for 135 epochs on the training and validation data sets from PASCAL VOC 2007 AND 2012 2. Testing data VOC 2007 & 2012 3. Batch size = 64, momentum = 0.9, decay = 0.0005 4. Learning rate : a. First few epochs , raise LR 10^-3 to 10^-2 b. Model diverges if starting LR is high due to unstable gradient c. first 75 epoch, LR 10^-2 d. next 30 epochs, LR 10^-3 e. next 30 epochs, LR 10^-4 5. To avoid overfitting: a. Dropout layer with rate 0.5 b. For Data Augmentation, scaling and translation up to 20% of original image size
  • 16.
    Inference ● On PASCALVOC YOLO predicts 98 BB per image and class probability for each box. ● Objects near border are localised by multiple cells ○ Non Maximal suppression can be used to fix these multiple detections (Non-max suppression is a way to eliminate points that do not lie in important edges. ) ■ Adds 2 to 3% to mAP
  • 17.
    Limitation of YOLO ●Struggle with small objects ● Struggles with difference aspects and ratio of objects ● Loss function treats error in different size of boxes same
  • 18.
    Comparison with otherReal time Systems: ● DPM : disjoint pipeline (sliding window, features, classify, predict BB) - YOLO concurrently ● R-CNN : region proposal , complex pipeline ( predict bb, extract features, non-max suppression) - 40 sec per image (2000 BB) : YOLO : 98 BB ● Deep Multibox : cnn, cannot do general detection ● OverFeat : cnn, disjoint system, no global context ● MultiGrasp : similar in design (YOLO) , only find a region
  • 19.
    Experiments ● PASCAL VOC 2007 ●Realtime : ○ YOLO VS DPM 30 Hz
  • 20.
  • 21.
    Combining Fast R-CNNand YOLO ● YOLO makes fewer background mistakes than Fast R-CNN ● This combination doesn’t benefit from the speed of YOLO since each model is run separately and then combine the results.
  • 22.
    VOC 2012 Results ●YOLO struggles with small objects (bottle, sheep, tv/monitor) ● Fast R-CNN + YOLO : Highest performing detection methods
  • 23.
    Generalizability: Person Detectionin Artwork ● YOLO has good performance on VOC 2007 ● Its AP degrades less than other methods when applied to artwork. ● Artwork / Natural Images are very different on a pixel level but very similar in terms of size and shape, so YOLO predicts good bounding boxes and detections.
  • 24.
  • 25.
    Darknet (YOLO) Resultson random images