AdapterDrop: On the Efficiency of Adapters in Transformers

Rücklé, Andreas; Geigle, Gregor; Glockner, Max; Beck, T.; Pfeiffer, Jonas; Reimers, Nils; Gurevych, Iryna

doi:10.48550/arxiv.2010.11918

Cited by 10 publications

(16 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…More on Fine-tuning: There exist other parameter-efficient tuning methods which we did not evaluate in our work. Some of these include random subspace projection (exploiting intrinsic dimensionality Aghajanyan et al, 2020)), prefix and prompt tuning Lester et al, 2021), tuning only biases (Cai et al, 2020;Ben Zaken et al, 2021), and other architecture variants including Adapters (Pfeiffer et al, 2021;Rücklé et al, 2020). An interesting direction for future work is to see whether parameter-efficient tuning approaches specifically designed for the private setting can achieve higher utility.…”

Section: Related Workmentioning

confidence: 99%

Differentially Private Fine-tuning of Language Models

Yu¹,

Naik²,

Bačkurs³

et al. 2021

Preprint

View full text Add to dashboard Cite

We give simpler, sparser, and faster algorithms for differentially private fine-tuning of large-scale pre-trained language models, which achieve the state-of-the-art privacy versus utility tradeoffs on many standard NLP tasks. We propose a meta-framework for this problem, inspired by the recent success of highly parameter-efficient methods for fine-tuning. Our experiments show that differentially private adaptations of these approaches outperform previous private algorithms in three important dimensions: utility, privacy, and the computational and memory cost of private training. On many commonly studied datasets, the utility of private models approaches that of non-private models. For example, on the MNLI dataset we achieve an accuracy of 87.8% using RoBERTa-Large and 83.5% using RoBERTa-Base with a privacy budget of ε = 6.7. In comparison, absent privacy constraints, RoBERTa-Large achieves an accuracy of 90.2%. Our findings are similar for natural language generation tasks. Privately fine-tuning with DART, GPT-2-Small, GPT-2-Medium, GPT-2-Large, and GPT-2-XL achieve BLEU scores of 38.5, 42.0, 43.1, and 43.8 respectively (privacy budget of ε = 6.8, δ = 1e-5) whereas the non-private baseline is 48.1. All our experiments suggest that larger models are better suited for private fine-tuning: while they are well known to achieve superior accuracy non-privately, we find that they also better maintain their accuracy when privacy is introduced.

show abstract

Section: Related Workmentioning

confidence: 99%

Differentially Private Fine-tuning of Language Models

Yu¹,

Naik²,

Bačkurs³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Specifically, one is to train a subset of the model parameters, where the most common approach is to use a linear probe on top of pretrained features [12]. The other alternative method surfaces by including new parameters in between the network [15,14,6,7,37,38]. Nevertheless, two problems arise when adopting these methods for fine-tuning Vision Transformers.…”

Section: Efficient Fine-tuning In Nlpmentioning

confidence: 99%

“…• AdapterDrop [14]: Adapterdrop is an extension of Adapter-tuning methods where it drops Adapters from lower Transformer layers during training and inference. In our experiments, we dropped Adapters from all layers except for the last layer in ViT.…”

Section: Baselinesmentioning

confidence: 99%

“…Specifically, we experiment with two types of Vision Transformers in the remainder of this paper: the one via Contrastive Language-Image Pretraining (also known as CLIP) [12], and the one via supervised pretraining (we refer to as Supervised ViT) [13]. In addition to Full-model Fine-tuning and linear probing, we re-implement several SOTA efficient fine-tuning methods [6,14,7,8,15] (originally proposed for pretrained language models) on vision tasks, and propose various baseline methods for comparison.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Parameter-efficient Model Adaptation for Vision Transformers

He¹,

Li²,

Zhang³

et al. 2022

Preprint

View full text Add to dashboard Cite

In computer vision, it has achieved great success in adapting large-scale pretrained vision models (e.g., Vision Transformer) to downstream tasks via fine-tuning. Common approaches for fine-tuning either update all model parameters or leverage linear probes. In this paper, we aim to study parameter-efficient fine-tuning strategies for Vision Transformers on vision tasks. We formulate efficient fine-tuning as a subspace training problem and perform a comprehensive benchmarking over different efficient fine-tuning methods. We conduct an empirical study on each efficient fine-tuning method focusing on its performance alongside parameter cost. Furthermore, we also propose a parameter-efficient fine-tuning framework, which first selects submodules by measuring local intrinsic dimensions and then projects them into subspace for further decomposition via a novel Kronecker Adaptation method. We analyze and compare our method with a diverse set of baseline fine-tuning methods (including state-of-the-art methods for pretrained language models). Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across three commonly used image classification datasets.Preprint. Under review.

show abstract

“…The rationales behind adapters Why adapter is able to achieve comparable accuracy with much fewer parameters than freezing the bottom transformer layers without revising the model structure? We reason it with two insights from our experiments and related literature [54,63].…”

Section: Design 31 Plugable Adaptersmentioning

confidence: 99%

AutoFedNLP: An efficient FedNLP framework

Cai¹,

Wu²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer-based pre-trained models have revolutionized NLP for superior performance and generality. Fine-tuning pre-trained models for downstream tasks often requires private data, for which federated learning is the de-facto approach (i.e., FedNLP). However, our measurements show that FedNLP is prohibitively slow due to the large model sizes and the resultant high network/computation cost. Towards practical FedNLP, we identify as the key building blocks adapters, small bottleneck modules inserted at a variety of model layers. A key challenge is to properly configure the depth and width of adapters, to which the training speed and efficiency is highly sensitive. No silver-bullet configuration exists: the optimal choice varies across downstream NLP tasks, desired model accuracy, and client resources. To automate adapter configuration, we propose AutoFedNLP, a framework that enhances the existing FedNLP with two novel designs. First, AutoFedNLP progressively upgrades the adapter configuration throughout a training session; the principle is to quickly learn shallow knowledge by only training fewer and smaller adapters at the model's top layers, and incrementally learn deep knowledge by incorporating deeper and larger adapters. Second, AutoFedNLP continuously profiles future adapter configurations by allocating participant devices to trial groups. To minimize client-side computations, AutoFedNLP exploits the fact that a FedNLP client trains on the same samples repeatedly between consecutive changes of adapter configurations, and caches computed activations on clients. Extensive experiments show that AutoFedNLP can reduce FedNLP's model convergence delay to no more than several hours, which is up to 155.5× faster compared to vanilla FedNLP and 48× faster compared to strong baselines.

show abstract

AdapterDrop: On the Efficiency of Adapters in Transformers

Cited by 10 publications

References 0 publications

Differentially Private Fine-tuning of Language Models

Differentially Private Fine-tuning of Language Models

Parameter-efficient Model Adaptation for Vision Transformers

AutoFedNLP: An efficient FedNLP framework

Contact Info

Product

Resources

About