UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

Mao, Yuning; Mathias, Lambert; Hou, Rui; Almahairi, Amjad; Ma, Hui; Yih, Wen-tau; Khabsa, Madian

doi:10.48550/arxiv.2110.07577

Cited by 6 publications

(8 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We implement our framework in Pytorch and use Tesla V100 gpus for experiments. AdaMix uses adapter dimension size of 16 and 48 using BERT-base and RoBERTalarge encoders respectively, following the setup of existing works Hu et al ( 2021); Mao et al (2021) for a fair comparison. The number of adapters in AdaMix is set to 4 for all the tasks and encoders unless otherwise specified.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

Wang¹,

Mukherjee²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters. This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation. Parameter-efficient techniques have been developed that tune small trainable components (e.g., adapters) injected in the large model while keeping most of the model weights frozen. The prevalent mechanism to increase adapter capacity is to increase the bottleneck dimension which increases the adapter parameters. In this work, we introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques. (i) We introduce multiple shared adapter components in each layer of the Transformer architecture. We leverage sparse learning via random routing to update the adapter parameters (encoder is kept frozen) resulting in the same amount of computational cost (FLOPs) as that of training a single adapter. (ii) We propose a simple merging mechanism to average the weights of multiple adapter components to collapse to a single adapter in each Transformer layer, thereby, keeping the overall parameters also the same but with significant performance improvement. We demonstrate these techniques to work well across multiple task settings including fully supervised and few-shot Natural Language Understanding tasks. By only tuning 0.23% of a pre-trained language model's parameters, our model 1 outperforms the full model fine-tuning performance and several competing methods.

show abstract

Section: Methodsmentioning

confidence: 99%

“…The best result on each task is in bold and "-" denotes the missing measure. † and denote that the reported results are taken fromMao et al (2021);Zaken et al (2021). The average performance is calculated based on F1 of QQP and MRPC.…”

mentioning

confidence: 99%

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

Wang¹,

Mukherjee²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Although each of these three approaches has its own focus, the central idea is to keep the pre-trained parameters constant while training lightweight alternatives to achieve adaptation for downstream tasks. There have also been some recent attempts to grasp the internal connection of these strategies and build a unified parameter-efficient tuning framework [333,334].…”

Section: Parameter-efficient Tuningmentioning

confidence: 99%

“…Recently, prompt tuning has been proposed, which will freeze big models and only tune task-specific prompts for downstream tasks [328,571,333]. Based on prompt tuning, we can update and correct the outdated knowledge in the process of continual learning.…”

Section: Continual Learningmentioning

confidence: 99%

A Roadmap for Big Model

Yuan¹,

Zhao²,

Jiahong³

et al. 2022

Preprint

View full text Add to dashboard Cite

domains indexed by Google News. It contains 31 million documents with an average length of 793 BPE tokens. Like C4, it excludes examples with duplicate URLs. News dumps from December 2016 through March 2019 were used as training data, articles published in April 2019 from the April 2019 dump were used for evaluation. OpenWebText2(OWT2). OWT2 is an enhanced version of the original OpenWebTextCorpus, including content from multiple languages, document metadata, multiple dataset versions, and open source replication code, covering all Reddit submissions from 2005 up until April 2020. PubMed Central(PMC). PMC is a free full-text archive of biomedical and life sciences journal literature from the U.S. National Institutes of Health's National Library of Medicine (NIH/NLM). The dataset is updated daily. In addition to full-text articles, they contain corrections, retractions, and expressions of concern, as well as file lists that include metadata for articles in each dataset.PMC obtained by open registration in Amazon Web Services (AWS) includes The PMC Open Access Subset and The Author Manuscript Dataset. The PMC Open Access Subset includes all articles and preprints in PMC with a machine-readable Creative Commons license that allows reuse. The Author Manuscript Dataset includes accepted author manuscripts collected under a funder policy in PMC and made available in machine-readable formats for text mining. ArXiv. ArXiv is a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more. It provides open access to academic articles, covering many subdisciplines from vast branches of physics to computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics, which is helpful to the potential downstream applications of the research field. In addition, the writing language of LaTeX also contributes to the study of language models. Colossal Clean Crawled Corpus(C4). C4 is a colossal, cleaned version of Common Crawl's web crawl corpus. It is based on Common Crawl dataset and was used to train the T5 text-to-text Transformer models. The cleaned English version of C4 has 364,868,901 training examples and 364,608 validation examples, while the uncleaned English version has 1,063,805,324 training examples and 1,065,029 validation examples; the realnewslike version has 13,799,838 training examples and 13,863 validation examples, while the webtextlike version has 4,500,788 training examples and 4,493 validation examples. Wiki-40B. Wikipedia (Wiki-40B) is a clean-up text collection containing more than 40 Wikipedia language editions of pages corresponding to entities. The dataset is split into train/validation/test sets for each language. The training set has 2,926,536 examples, the validation set has 163,597 examples, and the test set has 162,274 examples. Wiki-40B is cleaned by a page filter to remove ambiguous, redirected, deleted, and non-physical pages. CLUECorpus2020. CLUECorpus2020 ...

show abstract

“…Moreover, as the ratio of the number of parameters of models with respect to the labeled data increases, the fine-tuning process will be more prone to overfitting (Karimi Mahabadi et al, 2021). There are two categories of solutions: first, model compression (Jafari et al, 2021;Chen et al, 2021); second, parameter-efficient tuning (PET) (Houlsby et al, 2019a;Karimi Mahabadi et al, 2021;Mao et al, 2021).…”

Section: Introductionmentioning

confidence: 99%

DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation

Valipour,

Rezagholizadeh,

Kobyzev

et al. 2023

Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

View full text Add to dashboard Cite

With the ever-growing size of pretrained models (PMs), fine-tuning them has become more expensive and resource-hungry. As a remedy, low-rank adapters (LoRA) keep the main pretrained weights of the model frozen and just introduce some learnable truncated SVD modules (so-called LoRA blocks) to the model. While LoRA blocks are parameter-efficient, they suffer from two major problems: first, the size of these blocks is fixed and cannot be modified after training (for example, if we need to change the rank of LoRA blocks, then we need to retrain them from scratch); second, optimizing their rank requires an exhaustive search and effort. In this work, we introduce a dynamic low-rank adaptation (DyLoRA) technique to address these two problems together. Our Dy-LoRA method trains LoRA blocks for a range of ranks instead of a single rank by sorting the representation learned by the adapter module at different ranks during training. We evaluate our solution on different natural language understanding (GLUE benchmark) and language generation tasks (E2E, DART and WebNLG) using different pretrained models such as RoBERTa and GPT with different sizes. Our results show that we can train dynamic search-free models with DyLoRA at least 4 to 7 times faster than LoRA without significantly compromising performance. Moreover, our models can perform consistently well on a much larger range of ranks compared to LoRA. 1

show abstract

UniPELT: A Unified Framework for Parameter-Efficient Language Model Tuning

Cited by 6 publications

References 11 publications

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large Language Models

A Roadmap for Big Model

DyLoRA: Parameter-Efficient Tuning of Pre-trained Models using Dynamic Search-Free Low-Rank Adaptation

Contact Info

Product

Resources

About