A Unified Framework for Computer Vision Tasks:
(Conditional) Generative Model is All You Need
2022.10.17.
Sangwoo Mo
1
• Prior works designed a specific algorithm for each computer vision task
Motivation
2
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of semantic segmentation algorithm
Motivation
3
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of object detection algorithm
Motivation
4
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of object detection algorithm
Motivation
5
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• Example of instance segmentation algorithm
Motivation
6
Slide from Stanford CS231n
• Prior works designed a specific algorithm for each computer vision task
• However, those task-specific approach is not desirable
• Human may not use different techniques to solve such vision tasks
• Design a new algorithm for a new task (e.g., keypoint detection) is inefficient and impractical
• Goal. Build a single unified framework that can solve all (or most) computer vision tasks
• Prediction is just a X (input) to Y (output) mapping
• One can generally use a conditional generative model to predict arbitrary Y
Motivation
7
• This talks follows the recent journey of Ting Chen (1st author or SimCLR)
1. Tasks with sparse outputs (e.g., detection = object-wise bboxes)
• Idea: Use an autoregressive model to predict discrete tokens (e.g., sequence of bboxes)
• Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22)
• A Unified Sequence Interface for Vision Tasks (NeurIPS’22)
2. Tasks with dense outputs (e.g., segmentation = pixel-wise labels)
• Idea: Use a diffusion model to predict continuous outputs (e.g., segmentation maps)
• A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23)
Outline
8
• This talks follows the recent journey of Ting Chen (1st author or SimCLR)
Outline
9
• Pix2Seq
• Cast object descriptions as a sequence of discrete tokens (bboxes and class labels)
• Training and inference are done as LM (MLE training, stochastic decoding)
• Each object = {4 bbox coordinates + 1 class label}
• The coordinate is quantized to 𝑛!"#$ values, hence the vocab size = 𝑛!"#$ + 𝑛%&'$$ + 1 for [EOS] token
Tasks with sparse outputs
10
CNN encoder + Transformer decoder
• Pix2Seq
• Cast object descriptions as a sequence of discrete tokens (bboxes and class labels)
• Setting 𝑛!"#$ ≈ # of pixels is sufficient to detect small objects
Tasks with sparse outputs
11
• Pix2Seq
• Sequence augmentation to propose more regions and improve recall
• Pix2Seq misses some objects due to early stopping of decoding ([EOS] comes quickly)
• To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes
• Specially, get the 4 coordinates of a random rectangle and assign “noise” class
Tasks with sparse outputs
12
• Pix2Seq
• Sequence augmentation to propose more regions and improve recall
• To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes
• Then, the model decodes the fixed # of objects by replacing “noise” to the most likely class
• Sequence augmentation significantly improves the detection performance
• IMO, I think this trick can also be used for open-set scenario (get bbox of unknown objects)
Tasks with sparse outputs
13
• Pix2Seq
• Experimental results
• Pix2Seq is comparable with Faster R-CNN and DETR
• Pix2Seq scales for model size and resolution
Tasks with sparse outputs
14
• Pix2Seq – Multi-task
• The idea of Pix2Seq can be applied to various problems
• A single model solves detection, segmentation, and captioning by controlling the input prompt
Tasks with sparse outputs
15
• Pix2Seq – Multi-task
• The idea of Pix2Seq can be applied to various problems
• Object detection → same as before
• Captioning → obvious
• Instance segmentation & keypoint detection
→ Condition on each object bbox
• Seg mask → predict polygon
• Keypoint → predict seq. of points
{4 coordinates + keypoint label}
• The paper lacks explanation, but I guess
one needs a two-stage approach for
instance segmentation (get bboxes first
then predict the mask by conditioning)
Tasks with sparse outputs
16
• Pix2Seq – Multi-task
• Experimental results
• This unified framework works for various problems
Tasks with sparse outputs
17
• Pix2Seq-𝒟 (dense)
• Transformers can predict sparse outputs, but not suitable for dense outputs (e.g., pixel-wise segmentation)
• Instead, one can use a diffusion model to generate mask from image
Tasks with dense outputs
18
• Pix2Seq-𝒟 (dense)
• Instead, one can use a diffusion model to generate mask from image
• Condition on image and previous mask to predict next mask
Tasks with dense outputs
19
• Pix2Seq-𝒟 (dense)
• Instead, one can use a diffusion model to generate mask from image
• However, segmentation masks are discrete values (pixel-wise classification), so how to define the diffusion?
• The authors use Bit Diffusion, which converts the discrete values into binary bits and apply continuous diffusion
Tasks with dense outputs
20
• Pix2Seq-𝒟 (dense)
• Experimental results
• Works, but worse than task-specific models such as Mask DINO
Tasks with dense outputs
21
• TL;DR. Simple autoregressive or diffusion models can solve a large class of computer vision problems
• Discussion. General vs. task-specific algorithm design
• Currently, task-specific algorithm usually performs better by leveraging the structures of task
• However, the general-purpose algorithm may implicitly learn the structure of task from data
• E.g., ViT learns the spatial structure of images, e.g., translation equivariance
• I believe the model should reflect the task structures in some way, either explicitly or implicitly
• In this perspective, I think there are three directions for designing algorithms:
1. Keep design a task-specific algorithm (short-term goal before AGI comes)
2. Make the general-purpose model to better learn the task structures (e.g., SeqAug)
3. Analysis the structure learned by the general-purpose model (e.g., [1])
Discussion
22
[1] The Lie Derivative for Measuring Learned Equivariance → Analyze the equivariance learned by ViT
Thank you for listening! 😀
23

A Unified Framework for Computer Vision Tasks: (Conditional) Generative Model is All You Need

  • 1.
    A Unified Frameworkfor Computer Vision Tasks: (Conditional) Generative Model is All You Need 2022.10.17. Sangwoo Mo 1
  • 2.
    • Prior worksdesigned a specific algorithm for each computer vision task Motivation 2 Slide from Stanford CS231n
  • 3.
    • Prior worksdesigned a specific algorithm for each computer vision task • Example of semantic segmentation algorithm Motivation 3 Slide from Stanford CS231n
  • 4.
    • Prior worksdesigned a specific algorithm for each computer vision task • Example of object detection algorithm Motivation 4 Slide from Stanford CS231n
  • 5.
    • Prior worksdesigned a specific algorithm for each computer vision task • Example of object detection algorithm Motivation 5 Slide from Stanford CS231n
  • 6.
    • Prior worksdesigned a specific algorithm for each computer vision task • Example of instance segmentation algorithm Motivation 6 Slide from Stanford CS231n
  • 7.
    • Prior worksdesigned a specific algorithm for each computer vision task • However, those task-specific approach is not desirable • Human may not use different techniques to solve such vision tasks • Design a new algorithm for a new task (e.g., keypoint detection) is inefficient and impractical • Goal. Build a single unified framework that can solve all (or most) computer vision tasks • Prediction is just a X (input) to Y (output) mapping • One can generally use a conditional generative model to predict arbitrary Y Motivation 7
  • 8.
    • This talksfollows the recent journey of Ting Chen (1st author or SimCLR) 1. Tasks with sparse outputs (e.g., detection = object-wise bboxes) • Idea: Use an autoregressive model to predict discrete tokens (e.g., sequence of bboxes) • Pix2seq: A Language Modeling Framework for Object Detection (ICLR’22) • A Unified Sequence Interface for Vision Tasks (NeurIPS’22) 2. Tasks with dense outputs (e.g., segmentation = pixel-wise labels) • Idea: Use a diffusion model to predict continuous outputs (e.g., segmentation maps) • A Generalist Framework for Panoptic Segmentation of Images and Videos (submitted to ICLR’23) Outline 8
  • 9.
    • This talksfollows the recent journey of Ting Chen (1st author or SimCLR) Outline 9
  • 10.
    • Pix2Seq • Castobject descriptions as a sequence of discrete tokens (bboxes and class labels) • Training and inference are done as LM (MLE training, stochastic decoding) • Each object = {4 bbox coordinates + 1 class label} • The coordinate is quantized to 𝑛!"#$ values, hence the vocab size = 𝑛!"#$ + 𝑛%&'$$ + 1 for [EOS] token Tasks with sparse outputs 10 CNN encoder + Transformer decoder
  • 11.
    • Pix2Seq • Castobject descriptions as a sequence of discrete tokens (bboxes and class labels) • Setting 𝑛!"#$ ≈ # of pixels is sufficient to detect small objects Tasks with sparse outputs 11
  • 12.
    • Pix2Seq • Sequenceaugmentation to propose more regions and improve recall • Pix2Seq misses some objects due to early stopping of decoding ([EOS] comes quickly) • To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes • Specially, get the 4 coordinates of a random rectangle and assign “noise” class Tasks with sparse outputs 12
  • 13.
    • Pix2Seq • Sequenceaugmentation to propose more regions and improve recall • To avoid this, Pix2Seq keep the max size of bboxes by adding synthetic bboxes • Then, the model decodes the fixed # of objects by replacing “noise” to the most likely class • Sequence augmentation significantly improves the detection performance • IMO, I think this trick can also be used for open-set scenario (get bbox of unknown objects) Tasks with sparse outputs 13
  • 14.
    • Pix2Seq • Experimentalresults • Pix2Seq is comparable with Faster R-CNN and DETR • Pix2Seq scales for model size and resolution Tasks with sparse outputs 14
  • 15.
    • Pix2Seq –Multi-task • The idea of Pix2Seq can be applied to various problems • A single model solves detection, segmentation, and captioning by controlling the input prompt Tasks with sparse outputs 15
  • 16.
    • Pix2Seq –Multi-task • The idea of Pix2Seq can be applied to various problems • Object detection → same as before • Captioning → obvious • Instance segmentation & keypoint detection → Condition on each object bbox • Seg mask → predict polygon • Keypoint → predict seq. of points {4 coordinates + keypoint label} • The paper lacks explanation, but I guess one needs a two-stage approach for instance segmentation (get bboxes first then predict the mask by conditioning) Tasks with sparse outputs 16
  • 17.
    • Pix2Seq –Multi-task • Experimental results • This unified framework works for various problems Tasks with sparse outputs 17
  • 18.
    • Pix2Seq-𝒟 (dense) •Transformers can predict sparse outputs, but not suitable for dense outputs (e.g., pixel-wise segmentation) • Instead, one can use a diffusion model to generate mask from image Tasks with dense outputs 18
  • 19.
    • Pix2Seq-𝒟 (dense) •Instead, one can use a diffusion model to generate mask from image • Condition on image and previous mask to predict next mask Tasks with dense outputs 19
  • 20.
    • Pix2Seq-𝒟 (dense) •Instead, one can use a diffusion model to generate mask from image • However, segmentation masks are discrete values (pixel-wise classification), so how to define the diffusion? • The authors use Bit Diffusion, which converts the discrete values into binary bits and apply continuous diffusion Tasks with dense outputs 20
  • 21.
    • Pix2Seq-𝒟 (dense) •Experimental results • Works, but worse than task-specific models such as Mask DINO Tasks with dense outputs 21
  • 22.
    • TL;DR. Simpleautoregressive or diffusion models can solve a large class of computer vision problems • Discussion. General vs. task-specific algorithm design • Currently, task-specific algorithm usually performs better by leveraging the structures of task • However, the general-purpose algorithm may implicitly learn the structure of task from data • E.g., ViT learns the spatial structure of images, e.g., translation equivariance • I believe the model should reflect the task structures in some way, either explicitly or implicitly • In this perspective, I think there are three directions for designing algorithms: 1. Keep design a task-specific algorithm (short-term goal before AGI comes) 2. Make the general-purpose model to better learn the task structures (e.g., SeqAug) 3. Analysis the structure learned by the general-purpose model (e.g., [1]) Discussion 22 [1] The Lie Derivative for Measuring Learned Equivariance → Analyze the equivariance learned by ViT
  • 23.
    Thank you forlistening! 😀 23