Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
The document discusses the zero-shot learning capabilities of the CLIP model from OpenAI, including applications in zero-shot classification, image ranking, and search. It highlights the model's training process, limitations, and performance metrics, particularly in context with datasets like FairFace. Additionally, it touches on concepts like knowledge distillation, generative adversarial networks, and alternatives for object detection and segmentation.
Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI
1.
Zero-shot learning capabilities
ofCLIP model from
Yurii Pashchenko AI&BigData Online Day 2021
Yurii Pashchenko
Sr ML Engineer at Depositphotos
2.
About me
❏ YuriiPashchenko
❏ Sr Machine Learning Engineer at Depositphotos
❏ Over 8 years of research and commercial experience in
applying Deep Learning models
❏ Object Detection Specialist
❏ Knowledge Sharing Master at Transformer* at least I want to
become
��
3.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
4.
What is Zero-ShotLearning
Understanding Zero-Shot Learning — Making ML More Human
5.
Motivation of CLIPfrom OpenAI?
● Costly datasets
● Narrow
● Poor real-world performance
CLIP: Connecting Text and Images
6.
CLIP: Contrastive Language-Image
Pre-training
LearningTransferable Visual Models From Natural Language Supervision
● 400 million (image, text) pairs collected
from Internet.
● Trained modifications of ResNet-50
and ViT-B
● Batch size 32 768 for 32 epochs
● The largest ResNet model, RN50x64,
took 18 days to train on 592 V100
GPUs while the largest Vision
Transformer took 12 days on 256
V100 GPUs
7.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
8.
CLIP for Zero-ShotClassification
Learning Transferable Visual Models From Natural Language Supervision
Ensembling around 80
prompts improve
ImageNet accuracy by
almost 5%
CLIP Zero-Shot vsFew-Shot
Learning Transferable Visual Models From Natural Language Supervision
12.
CLIP on FairFace
FairFace:Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and
Mitigation
CLIP has a top-1 accuracy of 59.2% for “in the
wild” celebrity image classification when
choosing from 100 candidates and a top-1
accuracy of 43.3% when choosing from 1000
possible choices
13.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
14.
CLIP for ImageRanking
DALL·E: Creating Images from Text
“an armchair in the shape of an avocado”
“a living room with two white armchairs and a painting of the
collosseum. the painting is mounted above a modern fireplace”
15.
CLIP for ImageSearch
Text-to-Image
Unsplash Image Search
16.
CLIP for ImageSearch
Image-to-Image
Unsplash Image Search
17.
CLIP for ImageSearch
Text+Text-to-Image
Unsplash Image Search
18.
CLIP for ImageSearch
Image+Text-to-Image
Unsplash Image Search
+
“cars”
19.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
20.
CLIP limitations
Learning TransferableVisual Models From Natural Language Supervision
● poor generalization to images not covered
in its pre-training dataset (MNIST)
21.
Generic Attention-model Explainabilityfor Interpreting Bi-Modal and Encoder-Decoder Transformers
an elephant a zebra a lake
Text:
examples from this collab
CLIP limitations
22.
CLIP limitations
Learning TransferableVisual Models From Natural Language Supervision
● poor generalization to images not covered
in its pre-training dataset (MNIST)
● counting the number of objects in an image
● predicting how close the nearest object is in
a photo
● CLIP’s zero-shot classifiers can be sensitive
to wording or phrasing and sometimes
require trial and error “prompt engineering”
to perform well.
23.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
24.
You can’t justmake an Object Detector
from a Classifier
… without fine-tuning
25.
Assembling Object Detectorwith CLIP
Rich feature hierarchies for accurate object detection and semantic segmentation
CLIP
Text
Encoder
person
26.
Region proposals alternatives
SalientObject Detection Techniques in Computer Vision—A Survey
Salient object detection (SOD) is an important computer vision task aimed at precise
detection and segmentation of visually distinctive image regions from the perspective of the
human visual system
27.
Region proposals alternatives
Open-WorldEntity Segmentation
Entity Segmentation is a segmentation task with the aim to segment everything in an image
into semantically-meaningful regions without considering any category labels.
28.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
29.
What is knowledgedistillation?
Knowledge Distillation : Simplified
Knowledge distillation refers to the idea of model compression by teaching a smaller network,
step by step, exactly what to do using a bigger already trained network.
Zero-shot learning capabilitiesof CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classification based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP
Thank you foryour attention!
Yurii Pashchenko AI&BigData Online Day 2021
Yurii Pashchenko
Sr ML Engineer at Depositphotos
yurii_pas
george.pashchenko@gmail.com