Zero-shot learning capabilities
of CLIP model from
Yurii Pashchenko AI&BigData Online Day 2021
Yurii Pashchenko
Sr ML Engineer at Depositphotos
About me
❏ Yurii Pashchenko
❏ Sr Machine Learning Engineer at Depositphotos
❏ Over 8 years of research and commercial experience in
applying Deep Learning models
❏ Object Detection Specialist
❏ Knowledge Sharing Master at Transformer* at least I want to
become
��
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
What is Zero-Shot Learning
Understanding Zero-Shot Learning — Making ML More Human
Motivation of CLIP from OpenAI?
● Costly datasets
● Narrow
● Poor real-world performance
CLIP: Connecting Text and Images
CLIP: Contrastive Language-Image
Pre-training
Learning Transferable Visual Models From Natural Language Supervision
● 400 million (image, text) pairs collected
from Internet.
● Trained modifications of ResNet-50
and ViT-B
● Batch size 32 768 for 32 epochs
● The largest ResNet model, RN50x64,
took 18 days to train on 592 V100
GPUs while the largest Vision
Transformer took 12 days on 256
V100 GPUs
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
CLIP for Zero-Shot Classification
Learning Transferable Visual Models From Natural Language Supervision
Ensembling around 80
prompts improve
ImageNet accuracy by
almost 5%
CLIP Zero-Shot visual results
CLIP: Connecting Text and Images
CLIP Zero-Shot generalization
Learning Transferable Visual Models From Natural Language Supervision
CLIP Zero-Shot vs Few-Shot
Learning Transferable Visual Models From Natural Language Supervision
CLIP on FairFace
FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and
Mitigation
CLIP has a top-1 accuracy of 59.2% for “in the
wild” celebrity image classification when
choosing from 100 candidates and a top-1
accuracy of 43.3% when choosing from 1000
possible choices
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
CLIP for Image Ranking
DALL·E: Creating Images from Text
“an armchair in the shape of an avocado”
“a living room with two white armchairs and a painting of the
collosseum. the painting is mounted above a modern fireplace”
CLIP for Image Search
Text-to-Image
Unsplash Image Search
CLIP for Image Search
Image-to-Image
Unsplash Image Search
CLIP for Image Search
Text+Text-to-Image
Unsplash Image Search
CLIP for Image Search
Image+Text-to-Image
Unsplash Image Search
+
“cars”
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
CLIP limitations
Learning Transferable Visual Models From Natural Language Supervision
● poor generalization to images not covered
in its pre-training dataset (MNIST)
Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
an elephant a zebra a lake
Text:
examples from this collab
CLIP limitations
CLIP limitations
Learning Transferable Visual Models From Natural Language Supervision
● poor generalization to images not covered
in its pre-training dataset (MNIST)
● counting the number of objects in an image
● predicting how close the nearest object is in
a photo
● CLIP’s zero-shot classifiers can be sensitive
to wording or phrasing and sometimes
require trial and error “prompt engineering”
to perform well.
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
You can’t just make an Object Detector
from a Classifier
… without fine-tuning
Assembling Object Detector with CLIP
Rich feature hierarchies for accurate object detection and semantic segmentation
CLIP
Text
Encoder
person
Region proposals alternatives
Salient Object Detection Techniques in Computer Vision—A Survey
Salient object detection (SOD) is an important computer vision task aimed at precise
detection and segmentation of visually distinctive image regions from the perspective of the
human visual system
Region proposals alternatives
Open-World Entity Segmentation
Entity Segmentation is a segmentation task with the aim to segment everything in an image
into semantically-meaningful regions without considering any category labels.
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
What is knowledge distillation?
Knowledge Distillation : Simplified
Knowledge distillation refers to the idea of model compression by teaching a smaller network,
step by step, exactly what to do using a bigger already trained network.
Mask R-CNN
- Why?
- Class-agnostic bbox regression and mask prediction
Mask R-CNN
Vision and Language knowledge Distillation
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
VILD results
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
VILD generalization ability
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
VILD visualizations
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
VILD visualizations
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
Zero-Shot object tracking
Introducing Zero Shot Object Tracking
Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
VQGAN + CLIP
The Illustrated VQGAN
VQGAN + CLIP
https://github.com/nerdyrodent/VQGAN-CLIP
"A painting of an apple in a fruit bowl | psychedelic | surreal:0.5 |
weird:0.25"
"A painting of an apple in a fruit bowl"
StyleCLIP
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery
StyleGAN3
Alias-Free Generative Adversarial Networks (StyleGAN3)
StyleGAN3 + CLIP
StyleGAN3 + CLIP by mishin_learning
Thank you for your attention!
Yurii Pashchenko AI&BigData Online Day 2021
Yurii Pashchenko
Sr ML Engineer at Depositphotos
yurii_pas
george.pashchenko@gmail.com

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI

  • 1.
    Zero-shot learning capabilities ofCLIP model from Yurii Pashchenko AI&BigData Online Day 2021 Yurii Pashchenko Sr ML Engineer at Depositphotos
  • 2.
    About me ❏ YuriiPashchenko ❏ Sr Machine Learning Engineer at Depositphotos ❏ Over 8 years of research and commercial experience in applying Deep Learning models ❏ Object Detection Specialist ❏ Knowledge Sharing Master at Transformer* at least I want to become ��
  • 3.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 4.
    What is Zero-ShotLearning Understanding Zero-Shot Learning — Making ML More Human
  • 5.
    Motivation of CLIPfrom OpenAI? ● Costly datasets ● Narrow ● Poor real-world performance CLIP: Connecting Text and Images
  • 6.
    CLIP: Contrastive Language-Image Pre-training LearningTransferable Visual Models From Natural Language Supervision ● 400 million (image, text) pairs collected from Internet. ● Trained modifications of ResNet-50 and ViT-B ● Batch size 32 768 for 32 epochs ● The largest ResNet model, RN50x64, took 18 days to train on 592 V100 GPUs while the largest Vision Transformer took 12 days on 256 V100 GPUs
  • 7.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 8.
    CLIP for Zero-ShotClassification Learning Transferable Visual Models From Natural Language Supervision Ensembling around 80 prompts improve ImageNet accuracy by almost 5%
  • 9.
    CLIP Zero-Shot visualresults CLIP: Connecting Text and Images
  • 10.
    CLIP Zero-Shot generalization LearningTransferable Visual Models From Natural Language Supervision
  • 11.
    CLIP Zero-Shot vsFew-Shot Learning Transferable Visual Models From Natural Language Supervision
  • 12.
    CLIP on FairFace FairFace:Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and Mitigation CLIP has a top-1 accuracy of 59.2% for “in the wild” celebrity image classification when choosing from 100 candidates and a top-1 accuracy of 43.3% when choosing from 1000 possible choices
  • 13.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 14.
    CLIP for ImageRanking DALL·E: Creating Images from Text “an armchair in the shape of an avocado” “a living room with two white armchairs and a painting of the collosseum. the painting is mounted above a modern fireplace”
  • 15.
    CLIP for ImageSearch Text-to-Image Unsplash Image Search
  • 16.
    CLIP for ImageSearch Image-to-Image Unsplash Image Search
  • 17.
    CLIP for ImageSearch Text+Text-to-Image Unsplash Image Search
  • 18.
    CLIP for ImageSearch Image+Text-to-Image Unsplash Image Search + “cars”
  • 19.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 20.
    CLIP limitations Learning TransferableVisual Models From Natural Language Supervision ● poor generalization to images not covered in its pre-training dataset (MNIST)
  • 21.
    Generic Attention-model Explainabilityfor Interpreting Bi-Modal and Encoder-Decoder Transformers an elephant a zebra a lake Text: examples from this collab CLIP limitations
  • 22.
    CLIP limitations Learning TransferableVisual Models From Natural Language Supervision ● poor generalization to images not covered in its pre-training dataset (MNIST) ● counting the number of objects in an image ● predicting how close the nearest object is in a photo ● CLIP’s zero-shot classifiers can be sensitive to wording or phrasing and sometimes require trial and error “prompt engineering” to perform well.
  • 23.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 24.
    You can’t justmake an Object Detector from a Classifier … without fine-tuning
  • 25.
    Assembling Object Detectorwith CLIP Rich feature hierarchies for accurate object detection and semantic segmentation CLIP Text Encoder person
  • 26.
    Region proposals alternatives SalientObject Detection Techniques in Computer Vision—A Survey Salient object detection (SOD) is an important computer vision task aimed at precise detection and segmentation of visually distinctive image regions from the perspective of the human visual system
  • 27.
    Region proposals alternatives Open-WorldEntity Segmentation Entity Segmentation is a segmentation task with the aim to segment everything in an image into semantically-meaningful regions without considering any category labels.
  • 28.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 29.
    What is knowledgedistillation? Knowledge Distillation : Simplified Knowledge distillation refers to the idea of model compression by teaching a smaller network, step by step, exactly what to do using a bigger already trained network.
  • 30.
    Mask R-CNN - Why? -Class-agnostic bbox regression and mask prediction Mask R-CNN
  • 31.
    Vision and Languageknowledge Distillation Open-vocabulary Object Detection via Vision and Language Knowledge Distillation
  • 32.
    VILD results Open-vocabulary ObjectDetection via Vision and Language Knowledge Distillation
  • 33.
    VILD generalization ability Open-vocabularyObject Detection via Vision and Language Knowledge Distillation
  • 34.
    VILD visualizations Open-vocabulary ObjectDetection via Vision and Language Knowledge Distillation
  • 35.
    VILD visualizations Open-vocabulary ObjectDetection via Vision and Language Knowledge Distillation
  • 36.
    Zero-Shot object tracking IntroducingZero Shot Object Tracking
  • 37.
    Zero-shot learning capabilitiesof CLIP model from OpenAI ❏ Short intro to Zero-Shot Learning and CLIP from OpenAI ❏ Zero-Shot Classification based on CLIP ❏ CLIP for image ranking & search ❏ Limitations of CLIP model ❏ Object Detection/Segmentation ❏ Knowledge distillation ❏ GANs + CLIP
  • 38.
    VQGAN + CLIP TheIllustrated VQGAN
  • 39.
    VQGAN + CLIP https://github.com/nerdyrodent/VQGAN-CLIP "Apainting of an apple in a fruit bowl | psychedelic | surreal:0.5 | weird:0.25" "A painting of an apple in a fruit bowl"
  • 40.
  • 41.
  • 42.
    StyleGAN3 + CLIP StyleGAN3+ CLIP by mishin_learning
  • 43.
    Thank you foryour attention! Yurii Pashchenko AI&BigData Online Day 2021 Yurii Pashchenko Sr ML Engineer at Depositphotos yurii_pas george.pashchenko@gmail.com