2022
DOI: 10.48550/arxiv.2205.14865
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Prompt-aligned Gradient for Prompt Tuning

Abstract: Thanks to the large pre-trained vision-language models (VLMs) like CLIP [36], we can craft a zero-shot classifier by "prompt", e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity measure between the image and the prompt sentence "a photo of a [CLASS]". Therefore, prompt shows a great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the prompt-based similarity measure. However, we find a common failure that improper fine-tuning may… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
23
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(23 citation statements)
references
References 30 publications
0
23
0
Order By: Relevance
“…However, using tokens referring to the domain identifiers in the prompts improves baseline CLIP's performance. Finally, StyLIP outperforms the previous best prompting techniques [8,9] substantially, highlighting the importance of style and content disentanglement in the prompts for DG tasks. manner.…”
Section: Introductionmentioning
confidence: 91%
See 2 more Smart Citations
“…However, using tokens referring to the domain identifiers in the prompts improves baseline CLIP's performance. Finally, StyLIP outperforms the previous best prompting techniques [8,9] substantially, highlighting the importance of style and content disentanglement in the prompts for DG tasks. manner.…”
Section: Introductionmentioning
confidence: 91%
“…In this paper, we follow this research trend, but different from existing prompting methods [3][4][5]8], which evaluate the generalization capabilities of CLIP on datasets where the domain shift is limited (e.g. variants of ImageNet [17]), we study a more challenging setting where the visual appearance of images varies significantly from different domains.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…For example, CLIP [81] adopts linear probing [12,31,32,109] and full-finetuning [25,31,48,99,101,109] when transferring to downstream tasks. Prompt adaptation of CLIP [63,81,105,112,114] is motivated by the success of prefix-tuning for language models [16,22,30,45,61,78,84,85,89]. Similarly, CLIP-Adapter [21] and Tip-Adapter [111] are inspired by parameter-efficient finetuning methods [39,44,110] that optimize lightweight MLPs while freezing the encoder.…”
Section: Related Workmentioning
confidence: 99%
“…In contrast to our cross-modal approach, most prior works simply follow the popular practice of finetuning uni-modal foundation models, such as large vision [12,31,32] or language models [8,17,62]. For example, CoOp [113] and other prompting methods [63,112,114] finetune CLIP via prefix tuning to replace hand-engineered prompts such as "a photo of a {cls}" with learned word tokens. Similarly, inspired by parameter-efficient tuning of language models [39], adapter-based methods [21,111] finetune CLIP by inserting lightweight multi-layer-perceptrons (MLPs).…”
Section: Introductionmentioning
confidence: 99%