End-to-End Object Detection with Transformers
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Facebook AI | MICCAI 2020
2020.10.11
Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents
DeTr
Introduction – Background
• Set predictions of object detection – set of bounding boxes and class labels
• Modern detectors address this in an indirect way – surrogate regression, anchors, non-
maximum suppression procedure (NMS), etc.
• Such methods are significantly influenced by postprocessing steps.
Introduction / Related Work / Methods and Experiments / Conclusion
01
Introduction – Proposal
• Propose a direct set prediction approach to bypass the surrogate tasks.
• Adopt an encoder-decoder architecture based on transformers.
• Self-attention mechanisms of transformers, which model all pairwise interactions
between elements in a sequence, helps remove duplicate predictions.
• DEtection TRansformer (DETR) predicts all objects at once in end-to-end manner, with a
set loss function which performs bipartite matching between predicted and GT objects.
• DETR are the conjunction of the bipartite matching loss, transformers, with parallel
decoding.
Introduction / Related Work / Methods and Experiments / Conclusion
02
[Overview of proposed framework]
DETR
DETR
Introduction – Contribution
• DETR simplifies the detection pipeline by dropping multiple hand-designed
components that encode prior knowledge, like spatial anchors or NMS.
• DETR doesn’t require any customized layers and can be reproduced easily in any
framework that contains standard CNN and transformer classes.
• DETR easily extend to more complex tasks like Panoptic Segmentation.
• DETR demonstrates accuracy and run-time performance on par with the Faster R-CNN
baseline methods on COCO object detection dataset.
Introduction / Related Work / Methods and Experiments / Conclusion
03
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
04
Set Prediction
• No DL model to directly predict sets.
• Need to avoiding near-duplicates → most detectors use post-processings such as NSM.
• Direct set prediction method need global inference schemes that model interactions
between all predicted elements to avoid redundancy.
• For constant-size set prediction, FCN or RNN are used.
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
05
Transformers and Parallel Decoding
• Transformer is attention-based building
block for machine translation.
• Attention mechanisms aggregate
information from the entire input
sequence → Self-Attention layers
• Parallel sequence generation was
developed in the domains of audio,
machine translation, and speech
recognition.
Related Work
Introduction / Related Work / Methods and Experiments / Conclusion
06
Object Detection
• Two-stage detectors that predict boxes
w.r.t proposals
• Single-stage methods make predictions
w.r.t anchors or a grid of possible
object centers.
• DETR remove this hand-crafted process
by directly predicting the set of
detections with absolute box prediction
w.r.t the input image.
[YOLO]
[R-CNN]
Methods and Experiments
Proposed Framework
Introduction / Related Work / Methods and Experiments / Conclusion
07
• Backbone: Conventional CNN backbone that extract a compact feature representation.
• Transformer Encoder: 1x1 conv reduces the channel dimension → Flattens vector and add positional
encodings to the input of each attention layer. → Each encoder consists of multi-head self-attention
model and FFN.
• Transformer Decoder: Transforms N input embeddings (learnt positional encodings) using multi-headed
self- and encoder-decoder attention mechanisms. Decodes the N objects in parallel at each decoder
layer. Enables global reasons about all objects using pair-wise relations between them.
• Feed-Forward Networks (FFN): 3-layer perceptron with ReLU activation function and a linear projection
layer. It predicts the normalized center coordinates, height, and with of the box, and class label using
softmax function.
Methods and Experiments
Proposed Framework
Introduction / Related Work / Methods and Experiments / Conclusion
08
Methods and Experiments
Proposed Framework
Introduction / Related Work / Methods and Experiments / Conclusion
09
Methods and Experiments
Set Prediction Loss
Introduction / Related Work / Methods and Experiments / Conclusion
10
• DETR infers a fixed-size set of N predictions in a single pass through the decoder.
• Need to score predicted objects – class, position, size.
• Loss produces optimal bipartite matching between predicted and ground truth objects, and then
optimize object-specific (bounding box) losses.
,
• Compute Hungarian loss after each decoder layer for all pairs matched. A linear combination of a
negative log-likelihood for class prediction and a box loss.
• Auxiliary decoding loss in decoder to help the model output the correct number of objects of
each class.
Methods and Experiments
Experiments – Dataset and Settings
Introduction / Related Work / Methods and Experiments / Conclusion
11
• COCO 2017 detection and panoptic segmentation datasets
- 118k training images and 5k validation images
• Compare with Faster R-CNN using AP.
- DETR : ResNet-50 backbone
- DETR-R101: ResNet-110 backbone
- DETR-DC5: Use dilated conv
- DETR-DC5-R101: Dilated DETR-R101
• Trained baseline model for 300 epochs on 16 V100 GPUs for
3days with 4 images per GPU.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
12
• DETR with 6 transformer, 6 decoder layers of width 356 and 8 attention heads
• DETR competitive with Faster R-CNN with the same number of parameters.
• Improved performance on large samples, but still lagging in small objects.
→ Processing of global information by the self-attention
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
13
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
14
• Encoder seems to separate instances already, simplifying object extraction and
localization for the decoder.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
15
• Since encoder has separated instances via global attention, the decoder only needs to
attend to the extremities to extract the class and object boundaries.
Methods and Experiments
Experiments
Introduction / Related Work / Methods and Experiments / Conclusion
16
Methods and Experiments
Experiments – DETR for Panoptic Segmentation
Introduction / Related Work / Methods and Experiments / Conclusion
17
• Added a mask head on top of the decoder outputs that predicts a binary mask for each
of the predicted boxes.
• Mask heads takes output of transformer decoder for each object and computes multi-
head attention scores of this embeddings, generating attention heatmaps per object.
• For final prediction, FPN-like architecture is used.
• Mask heads can be trained either jointly, or in a two steps process (train DETR for boxes
only then freeze all the weights and train only the mask for 25 epochs)
Methods and Experiments
Experiments – DETR for Panoptic Segmentation
Introduction / Related Work / Methods and Experiments / Conclusion
18
Conclusion
Introduction / Related Work / Methods and Experiments / Conclusion
• DETR is a new design for object detection based on transformers and
bipartite matching loss for direct set prediction.
• Achieved comparable results to an optimized Faster R-CNN
• DETR achieved significantly better performance on detecting large
objects.
• DETR showed strength in segmenting stuff classes, owing to the global
reasoning allowed by the encoder attention.
• Challenge in training, optimization, and performances on small objects.
19

End-to-End Object Detection with Transformers

  • 1.
    End-to-End Object Detectionwith Transformers Hwang seung hyun Yonsei University Severance Hospital CCIDS Facebook AI | MICCAI 2020 2020.10.11
  • 2.
    Introduction Related WorkMethods and Experiments 01 02 03 Conclusion 04 Yonsei Unversity Severance Hospital CCIDS Contents
  • 3.
    DeTr Introduction – Background •Set predictions of object detection – set of bounding boxes and class labels • Modern detectors address this in an indirect way – surrogate regression, anchors, non- maximum suppression procedure (NMS), etc. • Such methods are significantly influenced by postprocessing steps. Introduction / Related Work / Methods and Experiments / Conclusion 01
  • 4.
    Introduction – Proposal •Propose a direct set prediction approach to bypass the surrogate tasks. • Adopt an encoder-decoder architecture based on transformers. • Self-attention mechanisms of transformers, which model all pairwise interactions between elements in a sequence, helps remove duplicate predictions. • DEtection TRansformer (DETR) predicts all objects at once in end-to-end manner, with a set loss function which performs bipartite matching between predicted and GT objects. • DETR are the conjunction of the bipartite matching loss, transformers, with parallel decoding. Introduction / Related Work / Methods and Experiments / Conclusion 02 [Overview of proposed framework] DETR
  • 5.
    DETR Introduction – Contribution •DETR simplifies the detection pipeline by dropping multiple hand-designed components that encode prior knowledge, like spatial anchors or NMS. • DETR doesn’t require any customized layers and can be reproduced easily in any framework that contains standard CNN and transformer classes. • DETR easily extend to more complex tasks like Panoptic Segmentation. • DETR demonstrates accuracy and run-time performance on par with the Faster R-CNN baseline methods on COCO object detection dataset. Introduction / Related Work / Methods and Experiments / Conclusion 03
  • 6.
    Related Work Introduction /Related Work / Methods and Experiments / Conclusion 04 Set Prediction • No DL model to directly predict sets. • Need to avoiding near-duplicates → most detectors use post-processings such as NSM. • Direct set prediction method need global inference schemes that model interactions between all predicted elements to avoid redundancy. • For constant-size set prediction, FCN or RNN are used.
  • 7.
    Related Work Introduction /Related Work / Methods and Experiments / Conclusion 05 Transformers and Parallel Decoding • Transformer is attention-based building block for machine translation. • Attention mechanisms aggregate information from the entire input sequence → Self-Attention layers • Parallel sequence generation was developed in the domains of audio, machine translation, and speech recognition.
  • 8.
    Related Work Introduction /Related Work / Methods and Experiments / Conclusion 06 Object Detection • Two-stage detectors that predict boxes w.r.t proposals • Single-stage methods make predictions w.r.t anchors or a grid of possible object centers. • DETR remove this hand-crafted process by directly predicting the set of detections with absolute box prediction w.r.t the input image. [YOLO] [R-CNN]
  • 9.
    Methods and Experiments ProposedFramework Introduction / Related Work / Methods and Experiments / Conclusion 07 • Backbone: Conventional CNN backbone that extract a compact feature representation. • Transformer Encoder: 1x1 conv reduces the channel dimension → Flattens vector and add positional encodings to the input of each attention layer. → Each encoder consists of multi-head self-attention model and FFN. • Transformer Decoder: Transforms N input embeddings (learnt positional encodings) using multi-headed self- and encoder-decoder attention mechanisms. Decodes the N objects in parallel at each decoder layer. Enables global reasons about all objects using pair-wise relations between them. • Feed-Forward Networks (FFN): 3-layer perceptron with ReLU activation function and a linear projection layer. It predicts the normalized center coordinates, height, and with of the box, and class label using softmax function.
  • 10.
    Methods and Experiments ProposedFramework Introduction / Related Work / Methods and Experiments / Conclusion 08
  • 11.
    Methods and Experiments ProposedFramework Introduction / Related Work / Methods and Experiments / Conclusion 09
  • 12.
    Methods and Experiments SetPrediction Loss Introduction / Related Work / Methods and Experiments / Conclusion 10 • DETR infers a fixed-size set of N predictions in a single pass through the decoder. • Need to score predicted objects – class, position, size. • Loss produces optimal bipartite matching between predicted and ground truth objects, and then optimize object-specific (bounding box) losses. , • Compute Hungarian loss after each decoder layer for all pairs matched. A linear combination of a negative log-likelihood for class prediction and a box loss. • Auxiliary decoding loss in decoder to help the model output the correct number of objects of each class.
  • 13.
    Methods and Experiments Experiments– Dataset and Settings Introduction / Related Work / Methods and Experiments / Conclusion 11 • COCO 2017 detection and panoptic segmentation datasets - 118k training images and 5k validation images • Compare with Faster R-CNN using AP. - DETR : ResNet-50 backbone - DETR-R101: ResNet-110 backbone - DETR-DC5: Use dilated conv - DETR-DC5-R101: Dilated DETR-R101 • Trained baseline model for 300 epochs on 16 V100 GPUs for 3days with 4 images per GPU.
  • 14.
    Methods and Experiments Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 12 • DETR with 6 transformer, 6 decoder layers of width 356 and 8 attention heads • DETR competitive with Faster R-CNN with the same number of parameters. • Improved performance on large samples, but still lagging in small objects. → Processing of global information by the self-attention
  • 15.
    Methods and Experiments Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 13
  • 16.
    Methods and Experiments Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 14 • Encoder seems to separate instances already, simplifying object extraction and localization for the decoder.
  • 17.
    Methods and Experiments Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 15 • Since encoder has separated instances via global attention, the decoder only needs to attend to the extremities to extract the class and object boundaries.
  • 18.
    Methods and Experiments Experiments Introduction/ Related Work / Methods and Experiments / Conclusion 16
  • 19.
    Methods and Experiments Experiments– DETR for Panoptic Segmentation Introduction / Related Work / Methods and Experiments / Conclusion 17 • Added a mask head on top of the decoder outputs that predicts a binary mask for each of the predicted boxes. • Mask heads takes output of transformer decoder for each object and computes multi- head attention scores of this embeddings, generating attention heatmaps per object. • For final prediction, FPN-like architecture is used. • Mask heads can be trained either jointly, or in a two steps process (train DETR for boxes only then freeze all the weights and train only the mask for 25 epochs)
  • 20.
    Methods and Experiments Experiments– DETR for Panoptic Segmentation Introduction / Related Work / Methods and Experiments / Conclusion 18
  • 21.
    Conclusion Introduction / RelatedWork / Methods and Experiments / Conclusion • DETR is a new design for object detection based on transformers and bipartite matching loss for direct set prediction. • Achieved comparable results to an optimized Faster R-CNN • DETR achieved significantly better performance on detecting large objects. • DETR showed strength in segmenting stuff classes, owing to the global reasoning allowed by the encoder attention. • Challenge in training, optimization, and performances on small objects. 19