VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Sung, Yi-Lin; Cho, Jaemin; Bansal, Mohit

doi:10.48550/arxiv.2112.06825

Cited by 5 publications

(6 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the field of computer vision, some adapters have been proposed for incremental learning [64] and domain adaptation [62,63]. With the advent of CLIP [60], many CLIP-based adapters [24,69,81] were presented to transfer pre-trained knowledge to zero-shot or few-shot downstream tasks. Recently, [43] employed some upsampling and downsampling modules to adapt the singlescale ViT to the multi-scale FPN [48].…”

Section: Related Workmentioning

confidence: 99%

Vision Transformer Adapter for Dense Predictions

Chen¹,

Duan²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

This work investigates a simple yet powerful adapter for Vision Transformer (ViT). Unlike recent visual transformers that introduce vision-specific inductive biases into their architectures, ViT achieves inferior performance on dense prediction tasks due to lacking prior information of images. To solve this issue, we propose a Vision Transformer Adapter (ViT-Adapter), which can remedy the defects of ViT and achieve comparable performance to vision-specific models by introducing inductive biases via an additional architecture. Specifically, the backbone in our framework is a vanilla transformer that can be pre-trained with multi-modal data. When fine-tuning on downstream tasks, a modalityspecific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks. We verify the effectiveness of our ViT-Adapter on multiple downstream tasks, including object detection, instance segmentation, and semantic segmentation. Notably, when using HTC++, our ViT-Adapter-L yields 60.1 AP b and 52.1 AP m on COCO test-dev, surpassing Swin-L by 1.4 AP b and 1.0 AP m . For semantic segmentation, our ViT-Adapter-L establishes a new state-ofthe-art of 60.5 mIoU on ADE20K val, 0.6 points higher than SwinV2-G. We hope that the proposed ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.

show abstract

Section: Related Workmentioning

confidence: 99%

Vision Transformer Adapter for Dense Predictions

Chen¹,

Duan²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…CoOp [87] and CoCoOp [88] apply prefix tuning for adapting the CLIP model to various image recognition tasks. VL-Adapter [68] achieves the performance comparable to full fine-tuning on challenging vision-language tasks. Commonly, their design focuses are all restricted to the text encoder of the CLIP model.…”

Section: Related Workmentioning

confidence: 99%

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Pan¹,

Lin²,

Zhu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (∼8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-theart video models, whilst enjoying the advantage of parameter efficiency. * Equal contribution Preprint. Under review.

show abstract

“…Specifically, one is to train a subset of the model parameters, where the most common approach is to use a linear probe on top of pretrained features [12]. The other alternative method surfaces by including new parameters in between the network [15,14,6,7,37,38]. Nevertheless, two problems arise when adopting these methods for fine-tuning Vision Transformers.…”

Section: Efficient Fine-tuning In Nlpmentioning

confidence: 99%

Parameter-efficient Model Adaptation for Vision Transformers

He¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In computer vision, it has achieved great success in adapting large-scale pretrained vision models (e.g., Vision Transformer) to downstream tasks via fine-tuning. Common approaches for fine-tuning either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient fine-tuning strategies for Vision Transformers on vision tasks. We formulate efficient fine-tuning as a subspace training problem and perform a comprehensive benchmarking over different efficient fine-tuning methods. We conduct an empirical study on each efficient fine-tuning method focusing on its performance alongside parameter cost. Furthermore, we also propose a parameter-efficient fine-tuning framework, which first selects submodules by measuring local intrinsic dimensions and then projects them into subspace for further decomposition via a novel Kronecker Adaptation method. We analyze and compare our method with a diverse set of baseline fine-tuning methods (including state-of-the-art methods for pretrained language models). Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across three commonly used image classification datasets.Preprint. Under review.

show abstract

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Cited by 5 publications

References 17 publications

Vision Transformer Adapter for Dense Predictions

Vision Transformer Adapter for Dense Predictions

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition

Parameter-efficient Model Adaptation for Vision Transformers

Contact Info

Product

Resources

About