Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI

Zero-shot learning capabilities
of CLIP model from
Yurii Pashchenko AI&BigData Online Day 2021
Yurii Pashchenko
Sr ML Engineer at Depositphotos

About me
❏ Yurii Pashchenko
❏ Sr Machine Learning Engineer at Depositphotos
❏ Over 8 years of research and commercial experience in
applying Deep Learning models
❏ Object Detection Specialist
❏ Knowledge Sharing Master at Transformer* at least I want to
become
��

Zero-shot learning capabilities of CLIP model
from OpenAI
❏ Short intro to Zero-Shot Learning and CLIP from OpenAI
❏ Zero-Shot Classiﬁcation based on CLIP
❏ CLIP for image ranking & search
❏ Limitations of CLIP model
❏ Object Detection/Segmentation
❏ Knowledge distillation
❏ GANs + CLIP

What is Zero-Shot Learning
Understanding Zero-Shot Learning — Making ML More Human

Motivation of CLIP from OpenAI?
● Costly datasets
● Narrow
● Poor real-world performance
CLIP: Connecting Text and Images

CLIP: Contrastive Language-Image
Pre-training
Learning Transferable Visual Models From Natural Language Supervision
● 400 million (image, text) pairs collected
from Internet.
● Trained modiﬁcations of ResNet-50
and ViT-B
● Batch size 32 768 for 32 epochs
● The largest ResNet model, RN50x64,
took 18 days to train on 592 V100
GPUs while the largest Vision
Transformer took 12 days on 256
V100 GPUs

CLIP for Zero-Shot Classiﬁcation
Ensembling around 80
prompts improve
ImageNet accuracy by
almost 5%

CLIP Zero-Shot visual results
CLIP: Connecting Text and Images

CLIP Zero-Shot generalization

CLIP Zero-Shot vs Few-Shot

CLIP on FairFace
FairFace: Face Attribute Dataset for Balanced Race, Gender, and Age for Bias Measurement and
Mitigation
CLIP has a top-1 accuracy of 59.2% for “in the
wild” celebrity image classiﬁcation when
choosing from 100 candidates and a top-1
accuracy of 43.3% when choosing from 1000
possible choices

CLIP for Image Ranking
DALL·E: Creating Images from Text
“an armchair in the shape of an avocado”
“a living room with two white armchairs and a painting of the
collosseum. the painting is mounted above a modern ﬁreplace”

CLIP for Image Search
Text-to-Image
Unsplash Image Search

Image-to-Image

Text+Text-to-Image

Image+Text-to-Image
+
“cars”

CLIP limitations
● poor generalization to images not covered
in its pre-training dataset (MNIST)

Generic Attention-model Explainability for Interpreting Bi-Modal and Encoder-Decoder Transformers
an elephant a zebra a lake
Text:
examples from this collab
CLIP limitations

CLIP limitations
● poor generalization to images not covered
in its pre-training dataset (MNIST)
● counting the number of objects in an image
● predicting how close the nearest object is in
a photo
● CLIP’s zero-shot classiﬁers can be sensitive
to wording or phrasing and sometimes
require trial and error “prompt engineering”
to perform well.

You can’t just make an Object Detector
from a Classiﬁer
… without ﬁne-tuning

Assembling Object Detector with CLIP
Rich feature hierarchies for accurate object detection and semantic segmentation
CLIP
Text
Encoder
person

Region proposals alternatives
Salient Object Detection Techniques in Computer Vision—A Survey
Salient object detection (SOD) is an important computer vision task aimed at precise
detection and segmentation of visually distinctive image regions from the perspective of the
human visual system

Region proposals alternatives
Open-World Entity Segmentation
Entity Segmentation is a segmentation task with the aim to segment everything in an image
into semantically-meaningful regions without considering any category labels.

What is knowledge distillation?
Knowledge Distillation : Simplified
Knowledge distillation refers to the idea of model compression by teaching a smaller network,
step by step, exactly what to do using a bigger already trained network.

Mask R-CNN
- Why?
- Class-agnostic bbox regression and mask prediction
Mask R-CNN

Vision and Language knowledge Distillation
Open-vocabulary Object Detection via Vision and Language Knowledge Distillation

VILD results

VILD generalization ability

VILD visualizations

Zero-Shot object tracking
Introducing Zero Shot Object Tracking

VQGAN + CLIP
The Illustrated VQGAN

VQGAN + CLIP
https://github.com/nerdyrodent/VQGAN-CLIP
"A painting of an apple in a fruit bowl | psychedelic | surreal:0.5 |
weird:0.25"
"A painting of an apple in a fruit bowl"

StyleCLIP
StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery

StyleGAN3
Alias-Free Generative Adversarial Networks (StyleGAN3)

StyleGAN3 + CLIP
StyleGAN3 + CLIP by mishin_learning

Thank you for your attention!
Yurii Pashchenko AI&BigData Online Day 2021
Yurii Pashchenko
Sr ML Engineer at Depositphotos
yurii_pas
george.pashchenko@gmail.com

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI

In this document

More Related Content

What's hot

Similar to Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI

More from Lviv Startup Club

Recently uploaded

Yurii Pashchenko: Zero-shot learning capabilities of CLIP model from OpenAI