A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need

A Unified Framework for Computer Vision Tasks:
(Conditional) Generative Model is All You Need
2022.10.17.
Sangwoo Mo
1

• Prior works designed a specific algorithm for each computer vision task
Motivation
2
Slide from Stanford CS231n

• Example of semantic segmentation algorithm
Motivation
3

• Example of object detection algorithm
Motivation
4

• Example of object detection algorithm
Motivation
5

• Example of instance segmentation algorithm
Motivation
6

• However, those task-specific approach is not desirable
• Human may not use different techniques to solve such vision tasks
• Design a new algorithm for a new task (e.g., keypoint detection) is inefficient and impractical
• Goal. Build a single unified framework that can solve all (or most) computer vision tasks
• Prediction is just a X (input) to Y (output) mapping
• One can generally use a conditional generative model to predict arbitrary Y
Motivation
7

• This talks follows the recent journey of Ting Chen (1st author or SimCLR)
1. Tasks with sparse outputs (e.g., detection = object-wise bboxes)
• Idea: Use an autoregressive model to predict discrete tokens (e.g., sequence of bboxes)
• Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22)
• A Unified Sequence Interface for Vision Tasks (NeurIPS’22)
2. Tasks with dense outputs (e.g., segmentation = pixel-wise labels)
• Idea: Use a diffusion model to predict continuous outputs (e.g., segmentation maps)
• A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23)
Outline
8

• This talks follows the recent journey of Ting Chen (1st author or SimCLR)
Outline
9

• Pix2Seq
• Cast object descriptions as a sequence of discrete tokens (bboxes and class labels)
• Training and inference are done as LM (MLE training, stochastic decoding)
• Each object = {4 bbox coordinates + 1 class label}
• The coordinate is quantized to 𝑛!"#$ values, hence the vocab size = 𝑛!"#$ + 𝑛%&'$$ + 1 for [EOS] token
Tasks with sparse outputs
10
CNN encoder + Transformer decoder

• Pix2Seq
• Cast object descriptions as a sequence of discrete tokens (bboxes and class labels)
• Setting 𝑛!"#$ ≈ # of pixels is sufficient to detect small objects
11

• Pix2Seq
• Sequence augmentation to propose more regions and improve recall
• Pix2Seq misses some objects due to early stopping of decoding ([EOS] comes quickly)
• To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes
• Specially, get the 4 coordinates of a random rectangle and assign “noise” class
12

• Pix2Seq
• Sequence augmentation to propose more regions and improve recall
• To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes
• Then, the model decodes the fixed # of objects by replacing “noise” to the most likely class
• Sequence augmentation significantly improves the detection performance
• IMO, I think this trick can also be used for open-set scenario (get bbox of unknown objects)
13

• Pix2Seq
• Experimental results
• Pix2Seq is comparable with Faster R-CNN and DETR
• Pix2Seq scales for model size and resolution
14

• Pix2Seq – Multi-task
• The idea of Pix2Seq can be applied to various problems
• A single model solves detection, segmentation, and captioning by controlling the input prompt
15

• The idea of Pix2Seq can be applied to various problems
• Object detection → same as before
• Captioning → obvious
• Instance segmentation & keypoint detection
→ Condition on each object bbox
• Seg mask → predict polygon
• Keypoint → predict seq. of points
{4 coordinates + keypoint label}
• The paper lacks explanation, but I guess
one needs a two-stage approach for
instance segmentation (get bboxes first
then predict the mask by conditioning)
16

• This unified framework works for various problems
17

• Pix2Seq-𝒟 (dense)
• Transformers can predict sparse outputs, but not suitable for dense outputs (e.g., pixel-wise segmentation)
• Instead, one can use a diffusion model to generate mask from image
Tasks with dense outputs
18

• Condition on image and previous mask to predict next mask
19

• However, segmentation masks are discrete values (pixel-wise classification), so how to define the diffusion?
• The authors use Bit Diffusion, which converts the discrete values into binary bits and apply continuous diffusion
20

• Works, but worse than task-specific models such as Mask DINO
21

• TL;DR. Simple autoregressive or diffusion models can solve a large class of computer vision problems
• Discussion. General vs. task-specific algorithm design
• Currently, task-specific algorithm usually performs better by leveraging the structures of task
• However, the general-purpose algorithm may implicitly learn the structure of task from data
• E.g., ViT learns the spatial structure of images, e.g., translation equivariance
• I believe the model should reflect the task structures in some way, either explicitly or implicitly
• In this perspective, I think there are three directions for designing algorithms:
1. Keep design a task-specific algorithm (short-term goal before AGI comes)
2. Make the general-purpose model to better learn the task structures (e.g., SeqAug)
3. Analysis the structure learned by the general-purpose model (e.g., [1])
Discussion
22
[1] The Lie Derivative for Measuring Learned Equivariance → Analyze the equivariance learned by ViT

Thank you for listening! 😀
23

A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need

More Related Content

What's hot

Similar to A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need

More from Sangwoo Mo

Recently uploaded

A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need