Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Dodge, Jesse; Ilharco, Gabriel; Schwartz, Roy; Farhadi, Ali; Hajishirzi, Hannaneh; Smith, Noah A.

doi:10.48550/arxiv.2002.06305

Cited by 134 publications

(194 citation statements)

References 0 publications

Supporting

Mentioning

186

Contrasting

Order By: Relevance

“…Quantization is always applied to the median checkpoint for the respective task. We exclude the problematic WNLI task (Levesque et al, 2012), as it has relatively small dataset and shows an unstable behaviour (Dodge et al, 2020), in particular due to several issues with the way the dataset was constructed 2 .…”

Section: B Experimental Details B1 Fp32 Fine-tuning Detailsmentioning

confidence: 99%

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei¹,

Nagel²,

Blankevoort³

2021

Preprint

View full text Add to dashboard Cite

Transformer-based architectures have become the de-facto standard models for a wide range of Natural Language Processing tasks. However, their memory footprint and high latency are prohibitive for efficient deployment and inference on resource-limited devices. In this work, we explore quantization for transformers. We show that transformers have unique quantization challenges -namely, high dynamic activation ranges that are difficult to represent with a low bit fixed-point format. We establish that these activations contain structured outliers in the residual connections that encourage specific attention patterns, such as attending to the special separator token.To combat these challenges, we present three solutions based on post-training quantization and quantization-aware training, each with a different set of compromises for accuracy, model size, and ease of use. In particular, we introduce a novel quantization scheme -per-embedding-group quantization. We demonstrate the effectiveness of our methods on the GLUE benchmark using BERT, establishing state-of-the-art results for post-training quantization. Finally, we show that transformer weights and embeddings can be quantized to ultra-low bit-widths, leading to significant memory savings with a minimum accuracy loss.Our source code is available at https://github. com/qualcomm-ai-research/ transformer-quantization.

show abstract

Section: B Experimental Details B1 Fp32 Fine-tuning Detailsmentioning

confidence: 99%

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Yelysei¹,

Nagel²,

Blankevoort³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…(2) RefCOCO+ (Yu et al, 2016) (3) Robust evaluation. Previous works show that model training on limited data can suffer from instability (Dodge et al, 2020;. For a robust and comprehensive evaluation, we report mean results from 5 random training and validation set split, as well as the standard deviation.…”

Section: Experimental Settingsmentioning

confidence: 99%

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

Yao¹,

Zhang²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Pre-Trained Vision-Language Models (VL-PTMs) have shown promising capabilities in grounding natural language in image data, facilitating a broad variety of cross-modal tasks. However, we note that there exists a significant gap between the objective forms of model pre-training and fine-tuning, resulting in a need for quantities of labeled data to stimulate the visual grounding capability of VL-PTMs for downstream tasks. To address the challenge, we present Cross-modal Prompt Tuning (CPT, alternatively, Colorful Prompt Tuning), a novel paradigm for tuning VL-PTMs, which reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap. In this way, our prompt tuning approach enables strong few-shot and even zero-shot visual grounding capabilities of VL-PTMs. Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin (e.g., 17.3% absolute accuracy improvement, and 73.8% relative standard deviation reduction on average with one shot in RefCOCO evaluation). All the data and code will be available to facilitate future research.

show abstract

“…Environmental Impact The use of large-scale Transformers requires a lot of computations and GPUs/TPUs for training, which contributes to global warming (Strubell et al, 2019;Schwartz et al, 2020). This is a smaller issue in our case, as we do not train such models from scratch; rather, we fine-tune them on relatively small datasets.…”

Section: Ethics and Broader Impactmentioning

confidence: 99%

“…Pre-trained transformers often suffer from instability of the results across multiple reruns with different random seeds. This usually happens with small training datasets (Dodge et al, 2020;Mosbach et al, 2021). In such cases, typically multiple reruns are performed, and the average value over these reruns is reported.…”

Section: E Impact Of the Random Seedmentioning

confidence: 99%

RuleBert: Teaching Soft Rules to Pre-trained Language Models

Saeed¹,

Ahmadi²,

Nakov³

et al. 2021

Preprint

View full text Add to dashboard Cite

While pre-trained language models (PLMs) are the go-to solution to tackle many natural language processing problems, they are still very limited in their ability to capture and to use common-sense knowledge. In fact, even if information is available in the form of approximate (soft) logical rules, it is not clear how to transfer it to a PLM in order to improve its performance for deductive reasoning tasks. Here, we aim to bridge this gap by teaching PLMs how to reason with soft Horn rules. We introduce a classification task where, given facts and soft rules, the PLM should return a prediction with a probability for a given hypothesis. We release the first dataset for this task, and we propose a revised loss function that enables the PLM to learn how to predict precise probabilities for the task. Our evaluation results show that the resulting fine-tuned models achieve very high performance, even on logical rules that were unseen at training. Moreover, we demonstrate that logical notions expressed by the rules are transferred to the finetuned model, yielding state-of-the-art results on external datasets.

show abstract

Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping

Cited by 134 publications

References 0 publications

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

Understanding and Overcoming the Challenges of Efficient Transformer Quantization

CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models

RuleBert: Teaching Soft Rules to Pre-trained Language Models

Contact Info

Product

Resources

About