2021
DOI: 10.48550/arxiv.2109.01134
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Learning to Prompt for Vision-Language Models

Kaiyang Zhou,
Jingkang Yang,
Chen Change Loy
et al.

Abstract: Vision-language pre-training has recently emerged as a promising alternative for representation learning. It shifts from the tradition of using images and discrete labels for learning a fixed set of weights, seen as visual concepts, to aligning images and raw text for two separate encoders. Such a paradigm benefits from a broader source of supervision and allows zero-shot transfer to downstream tasks since visual concepts can be diametrically generated from natural language, known as prompt. In this paper, we … Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
151
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 60 publications
(152 citation statements)
references
References 20 publications
1
151
0
Order By: Relevance
“…However, the performance gap from full-model fine-tuning closes up as the pre-trained model gets larger [33,42]. Inspired by the success of prompt tuning in NLP, [77] applies prompt tuning to visual-linguistic pre-trained models (e.g., CLIP [55]) to perform few-shot image classification. [51,58] further apply a residual feature adapter to improve the few-shot performance.…”
Section: Related Workmentioning
confidence: 99%
“…However, the performance gap from full-model fine-tuning closes up as the pre-trained model gets larger [33,42]. Inspired by the success of prompt tuning in NLP, [77] applies prompt tuning to visual-linguistic pre-trained models (e.g., CLIP [55]) to perform few-shot image classification. [51,58] further apply a residual feature adapter to improve the few-shot performance.…”
Section: Related Workmentioning
confidence: 99%
“…Continuous prompts. There is a line of work focused on tuning continuous prompts (Li and Liang, 2021;Lester et al, 2021;Zhong et al, 2021;Qin and Eisner, 2021;Zhou et al, 2021). A recurring theme in this line of work is the strength of continuous prompt in results in strong, yet compact models-compared to conventional architecture fine-tuning approaches.…”
Section: Related Workmentioning
confidence: 99%
“…Inspired by the success of pre-train models [6], [4] [25] and SimVLM [37] use attention architecture further improve the performance of vision-language tasks. At present, the recent breakthrough in vision-language learning, particularly CLIP [31] and ALIGN [18] are driven by the noisy largescale datasets available in the Internet, which is 400 million image-text pair for CLIP and 1.8 billion noisy image-text pairs for ALIGN.To finetune vision-Language Models on downstream tasks like few-shot classification task, CoOp [42] propose to learn soft prompts represented by continuous context vectors as alternative for hand-craft prompt while CLIP-Adapter propose to adopts an additional bottleneck layer to learn new features and performs residual style feature blending with the original pre-trained features. Though CoOp and CLIP-Adapter achieve significant performance in the perspective of prompt learning and feature adapters, our VT-CLIP explores the impact of instance-level visual feature on text feature with a cross-attention module.…”
Section: Related Workmentioning
confidence: 99%