Visual Prompt Tuning

Jia, Menglin; Tang, Luming; Chen, Bor-Chun; Cardie, Claire; Belongie, Serge; Hariharan, Bharath; Lim, Ser-Nam

doi:10.48550/arxiv.2203.12119

Cited by 25 publications

(40 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Rare attention has been drawn to the field of efficient adaptation, especially in the field of vision Transformers. Inspired by Prompting in NLP, [45] introduced the learnable tokens in exploring the efficient adaptation for ViTs. We empirically found that the performance of prompting is hindered by the scale of tokens.…”

Section: Efficient Transfer Learning For Transformersmentioning

confidence: 99%

“…Compared to our methods, we notice that recent prompt-related approaches insert trainable parameters into the token space, as illustrated in Figure 3. They prepend learnable parameters either into the embedded tokens before linear projection [52] or the key and value tokens after linear projection [45]. Therefore, the prompt-related method can not be straightforwardly adapted to special MHSA variants, especially for the one that takes the pyramid spatial information into account [56,73].…”

Section: Multi-head Attentionmentioning

confidence: 99%

“…On the more challenging video action recognition dataset Something-Something V2, the superiority becomes even more significant, i.e., about 34.96%. Note that even compared with the full fine-tuning Figure 5: Test accuracy of VPT [45] with different number of introduced tokens. The optimization procedure becomes unstable when the token number is equal or larger than eight on HMDB51 dataset [49].…”

Section: Main Properties and Analysismentioning

confidence: 99%

“…More recently, Bahng et.al., [4] aimed to adapt pre-trained models by modifying raw input pixel space. Jia et.al., [45] proposed Visual Prompt Tuning (VPT) to adapt transformer models for downstream vision tasks, which prepends several learnable parameters (prompts) to the patch embeddings and freezes the whole pre-trained backbone.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa¹,

Ge²,

Zhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

Although the pre-trained Vision Transformers (ViTs) achieved great success in computer vision, adapting a ViT to various image and video tasks is challenging because of its heavy computation and storage burdens, where each model needs to be independently and comprehensively fine-tuned to different tasks, limiting its transferability in different domains. To address this challenge, we propose an effective adaptation approach for Transformer, namely AdaptFormer, which can adapt the pre-trained ViTs into many different image and video tasks efficiently. It possesses several benefits more appealing than prior arts. Firstly, AdaptFormer introduces lightweight modules that only add less than 2% extra parameters to a ViT, while it is able to increase the ViT's transferability without updating its original pre-trained parameters, significantly outperforming the existing 100% fully fine-tuned models on action recognition benchmarks. Secondly, it can be plug-andplay in different Transformers and scalable to many visual tasks. Thirdly, extensive experiments on five image and video datasets show that AdaptFormer largely improves ViTs in the target domains. For example, when updating just 1.5% extra parameters, it achieves about 10% and 19% relative improvement compared to the fully fine-tuned models on Something-Something v2 and HMDB51, respectively.

show abstract

Section: Efficient Transfer Learning For Transformersmentioning

confidence: 99%

Section: Multi-head Attentionmentioning

confidence: 99%

Section: Main Properties and Analysismentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition

Shoufa¹,

Ge²,

Zhan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Prompt tuning and other PEFT methods have also been explored outside of the context of language models (e.g. vision [22,69] and vision-and-language models [26]).…”

Section: Related Workmentioning

confidence: 99%

Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning

Liu¹,

Tam²,

Muqeeth³

et al. 2022

Preprint

View full text Add to dashboard Cite

Few-shot in-context learning (ICL) enables pre-trained language models to perform a previously-unseen task without any gradient-based training by feeding a small number of training examples as part of the input. ICL incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made. Parameter-efficient fine-tuning (e.g. adapter modules, prompt tuning, sparse update methods, etc.) offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task. In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs. Along the way, we introduce a new parameter-efficient fine-tuning method called (IA) 3 that scales activations by learned vectors, attaining stronger performance while only introducing a relatively tiny amount of new parameters. We also propose a simple recipe based on the T0 model [1] called T-Few that can be applied to new tasks without task-specific tuning or modifications. We validate the effectiveness of T-Few on completely unseen tasks by applying it to the RAFT benchmark [2], attaining super-human performance for the first time and outperforming the state-of-the-art by 6% absolute. All of the code used in our experiments is publicly available. 1 * Equal contribution. 1 https://github.com/r-three/t-few Preprint. Under review.

show abstract