2021
DOI: 10.48550/arxiv.2112.09445
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data Efficient Language-supervised Zero-shot Recognition with Optimal Transport Distillation

Abstract: Traditional computer vision models are trained to predict a fixed set of predefined categories. Recently, natural language has been shown to be a broader and richer source of supervision that provides finer descriptions to visual concepts than supervised "gold" labels. Previous works, such as CLIP, use InfoNCE loss to train a model to predict the pairing between images and text captions. CLIP, however, is data hungry and requires more than 400M image-text pairs for training. The inefficiency can be partially a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 35 publications
0
4
0
Order By: Relevance
“…We follow zero-shot CLIP benchmark 7 implementation for most of the datasets, and implement the ones that are missing. For most image classification tasks we compute Accuracy@1, except HatefulMemes where we compute AUROC because it is binary classification, OpenImages where we compute FlatHit@1 following [75], and PascalVOC2007 where we compute mean average precision (mAP) because it is multi-label classification. We use the same prompt ensembling method as CLIP [61] to improve zero-shot image classification.…”
Section: A3 Evaluation Detailsmentioning
confidence: 99%
“…We follow zero-shot CLIP benchmark 7 implementation for most of the datasets, and implement the ones that are missing. For most image classification tasks we compute Accuracy@1, except HatefulMemes where we compute AUROC because it is binary classification, OpenImages where we compute FlatHit@1 following [75], and PascalVOC2007 where we compute mean average precision (mAP) because it is multi-label classification. We use the same prompt ensembling method as CLIP [61] to improve zero-shot image classification.…”
Section: A3 Evaluation Detailsmentioning
confidence: 99%
“…The core technique of CLIP is aligning both vision and language modalities in a joint embedding space by global representation contrastive. Followup works further improve CLIP from the vision-only [43,67] or vision-language [38,64,68,69] side. In this paper, we bring natural language supervision together with masked image modeling for better visual pre-training on these two paradigms.…”
Section: Related Workmentioning
confidence: 99%
“…The definition of OT has been applied to many areas, such as domain adaptation [11], generative models [6,33] and self-supervision learning [24,40]. For tasks involving cross-lingual settings, Nguyen and Luu [29] employed OT distance as a part of the loss function in a knowledge distillation framework for improving the cross-lingual summarization.…”
Section: Optimal Transportmentioning
confidence: 99%