End-to-End Object Detection with Transformers

End-to-End Object Detection with Transformers
Hwang seung hyun
Yonsei University Severance Hospital CCIDS
Facebook AI | MICCAI 2020
2020.10.11

Introduction Related Work Methods and
Experiments
01 02 03
Conclusion
04
Yonsei Unversity Severance Hospital CCIDS
Contents

DeTr
Introduction – Background
• Set predictions of object detection – set of bounding boxes and class labels
• Modern detectors address this in an indirect way – surrogate regression, anchors, non-
maximum suppression procedure (NMS), etc.
• Such methods are significantly influenced by postprocessing steps.
Introduction / Related Work / Methods and Experiments / Conclusion
01

Introduction – Proposal
• Propose a direct set prediction approach to bypass the surrogate tasks.
• Adopt an encoder-decoder architecture based on transformers.
• Self-attention mechanisms of transformers, which model all pairwise interactions
between elements in a sequence, helps remove duplicate predictions.
• DEtection TRansformer (DETR) predicts all objects at once in end-to-end manner, with a
set loss function which performs bipartite matching between predicted and GT objects.
• DETR are the conjunction of the bipartite matching loss, transformers, with parallel
decoding.
02
[Overview of proposed framework]
DETR

DETR
Introduction – Contribution
• DETR simplifies the detection pipeline by dropping multiple hand-designed
components that encode prior knowledge, like spatial anchors or NMS.
• DETR doesn’t require any customized layers and can be reproduced easily in any
framework that contains standard CNN and transformer classes.
• DETR easily extend to more complex tasks like Panoptic Segmentation.
• DETR demonstrates accuracy and run-time performance on par with the Faster R-CNN
baseline methods on COCO object detection dataset.
03

Related Work
04
Set Prediction
• No DL model to directly predict sets.
• Need to avoiding near-duplicates → most detectors use post-processings such as NSM.
• Direct set prediction method need global inference schemes that model interactions
between all predicted elements to avoid redundancy.
• For constant-size set prediction, FCN or RNN are used.

Related Work
05
Transformers and Parallel Decoding
• Transformer is attention-based building
block for machine translation.
• Attention mechanisms aggregate
information from the entire input
sequence → Self-Attention layers
• Parallel sequence generation was
developed in the domains of audio,
machine translation, and speech
recognition.

Related Work
06
Object Detection
• Two-stage detectors that predict boxes
w.r.t proposals
• Single-stage methods make predictions
w.r.t anchors or a grid of possible
object centers.
• DETR remove this hand-crafted process
by directly predicting the set of
detections with absolute box prediction
w.r.t the input image.
[YOLO]
[R-CNN]

Methods and Experiments
Proposed Framework
07
• Backbone: Conventional CNN backbone that extract a compact feature representation.
• Transformer Encoder: 1x1 conv reduces the channel dimension → Flattens vector and add positional
encodings to the input of each attention layer. → Each encoder consists of multi-head self-attention
model and FFN.
• Transformer Decoder: Transforms N input embeddings (learnt positional encodings) using multi-headed
self- and encoder-decoder attention mechanisms. Decodes the N objects in parallel at each decoder
layer. Enables global reasons about all objects using pair-wise relations between them.
• Feed-Forward Networks (FFN): 3-layer perceptron with ReLU activation function and a linear projection
layer. It predicts the normalized center coordinates, height, and with of the box, and class label using
softmax function.

Proposed Framework
08

Proposed Framework
09

Set Prediction Loss
10
• DETR infers a fixed-size set of N predictions in a single pass through the decoder.
• Need to score predicted objects – class, position, size.
• Loss produces optimal bipartite matching between predicted and ground truth objects, and then
optimize object-specific (bounding box) losses.
,
• Compute Hungarian loss after each decoder layer for all pairs matched. A linear combination of a
negative log-likelihood for class prediction and a box loss.
• Auxiliary decoding loss in decoder to help the model output the correct number of objects of
each class.

Experiments – Dataset and Settings
11
• COCO 2017 detection and panoptic segmentation datasets
- 118k training images and 5k validation images
• Compare with Faster R-CNN using AP.
- DETR : ResNet-50 backbone
- DETR-R101: ResNet-110 backbone
- DETR-DC5: Use dilated conv
- DETR-DC5-R101: Dilated DETR-R101
• Trained baseline model for 300 epochs on 16 V100 GPUs for
3days with 4 images per GPU.

Experiments
12
• DETR with 6 transformer, 6 decoder layers of width 356 and 8 attention heads
• DETR competitive with Faster R-CNN with the same number of parameters.
• Improved performance on large samples, but still lagging in small objects.
→ Processing of global information by the self-attention

Experiments
13

Experiments
14
• Encoder seems to separate instances already, simplifying object extraction and
localization for the decoder.

Experiments
15
• Since encoder has separated instances via global attention, the decoder only needs to
attend to the extremities to extract the class and object boundaries.

Experiments
16

Experiments – DETR for Panoptic Segmentation
17
• Added a mask head on top of the decoder outputs that predicts a binary mask for each
of the predicted boxes.
• Mask heads takes output of transformer decoder for each object and computes multi-
head attention scores of this embeddings, generating attention heatmaps per object.
• For final prediction, FPN-like architecture is used.
• Mask heads can be trained either jointly, or in a two steps process (train DETR for boxes
only then freeze all the weights and train only the mask for 25 epochs)

Experiments – DETR for Panoptic Segmentation
18

Conclusion
• DETR is a new design for object detection based on transformers and
bipartite matching loss for direct set prediction.
• Achieved comparable results to an optimized Faster R-CNN
• DETR achieved significantly better performance on detecting large
objects.
• DETR showed strength in segmenting stuff classes, owing to the global
reasoning allowed by the encoder attention.
• Challenge in training, optimization, and performances on small objects.
19

End-to-End Object Detection with Transformers

More Related Content

What's hot

Similar to End-to-End Object Detection with Transformers

More from Seunghyun Hwang

Recently uploaded

In this document

End-to-End Object Detection with Transformers