2021
DOI: 10.48550/arxiv.2112.06825
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Abstract: Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks as well as on pure language tasks. However, fine-tuning the entire parameter set of pre-trained models becomes impractical since the model size is growing rapidly. Hence, in this paper, we introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5. We evaluate our methods in a unified multi-task setup on four diverse V&L… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(6 citation statements)
references
References 17 publications
0
6
0
Order By: Relevance
“…In the field of computer vision, some adapters have been proposed for incremental learning [64] and domain adaptation [62,63]. With the advent of CLIP [60], many CLIP-based adapters [24,69,81] were presented to transfer pre-trained knowledge to zero-shot or few-shot downstream tasks. Recently, [43] employed some upsampling and downsampling modules to adapt the singlescale ViT to the multi-scale FPN [48].…”
Section: Related Workmentioning
confidence: 99%
“…In the field of computer vision, some adapters have been proposed for incremental learning [64] and domain adaptation [62,63]. With the advent of CLIP [60], many CLIP-based adapters [24,69,81] were presented to transfer pre-trained knowledge to zero-shot or few-shot downstream tasks. Recently, [43] employed some upsampling and downsampling modules to adapt the singlescale ViT to the multi-scale FPN [48].…”
Section: Related Workmentioning
confidence: 99%
“…CoOp [87] and CoCoOp [88] apply prefix tuning for adapting the CLIP model to various image recognition tasks. VL-Adapter [68] achieves the performance comparable to full fine-tuning on challenging vision-language tasks. Commonly, their design focuses are all restricted to the text encoder of the CLIP model.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, one is to train a subset of the model parameters, where the most common approach is to use a linear probe on top of pretrained features [12]. The other alternative method surfaces by including new parameters in between the network [15,14,6,7,37,38]. Nevertheless, two problems arise when adopting these methods for fine-tuning Vision Transformers.…”
Section: Efficient Fine-tuning In Nlpmentioning
confidence: 99%