2022
DOI: 10.48550/arxiv.2201.05729
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CLIP-TD: CLIP Targeted Distillation for Vision-Language Tasks

Abstract: Contrastive language-image pretraining (CLIP) links vision and language modalities into a unified embedding space, yielding tremendous potential for vision-language (VL) tasks. While early concurrent works have begun to study this potential on a subset of tasks, important questions remain: 1) What is the benefit of CLIP on unstudied VL tasks? 2) Does CLIP provide benefit in low-shot or domain shifted scenarios? 3) Can CLIP improve existing approaches without impacting inference or pretraining complexity? In th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
7
1

Relationship

0
8

Authors

Journals

citations
Cited by 8 publications
(7 citation statements)
references
References 41 publications
(68 reference statements)
0
3
0
Order By: Relevance
“…Many algorithms have been developed lately [34,31,36,24,26,28] trying knowledge distillation from the CLIP model to benefit the down-stream tasks one way or the other by leveraging the rich semantic language information paired in the images. Here, we directly adopt the backbone of CLIP image model to train for open-vocabulary panoptic/semantic segmentation.…”
Section: Related Workmentioning
confidence: 99%
“…Many algorithms have been developed lately [34,31,36,24,26,28] trying knowledge distillation from the CLIP model to benefit the down-stream tasks one way or the other by leveraging the rich semantic language information paired in the images. Here, we directly adopt the backbone of CLIP image model to train for open-vocabulary panoptic/semantic segmentation.…”
Section: Related Workmentioning
confidence: 99%
“…It contains an image encoder and a text encoder to measure the content similarity of the given image and text. Many recent works have transferred it on multiple downstream tasks, including semantic segmentation [8,24], object detection [5], Visual Question Answering [48] and image generation [33]. Many researchers regard CLIP as a pre-trained feature extractor [8,24,26,38,52].…”
Section: Vision-language Contrastive Learningmentioning
confidence: 99%
“…With UM and DS biases appearing to be dominant in text information, existing VL models are encouraged to learn shortcuts in text and under-utilize the visual information leading to false visual dependency, as in Figure 6a (Cao et al, 2020;Wang et al, 2022c;Dancette et al, 2021;Chen et al, 2020a). To assist models in mitigating UM and DS biases, we design ADS-I to synthesize positive images I+ and negative images Ito assist models' training and emphasize the correct question-related visual dependency.…”
Section: Ads-imentioning
confidence: 99%