What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

Lee, Jaejun; Tang, Rachel; Lin, Jimmy

doi:10.48550/arxiv.1911.03090

Cited by 20 publications

(24 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We observe that we gain similar speedup and max accuracy for all the runs for each dataset. We also include the test accuracy convergence curve with respect to time for each of the three repeated runs using stepped learning rate schedule for each dataset in Figures 19,20,21,22,23,and 24. We see that AutoFreeze and full fine-tuning achieve comparable max accuracy with an average end-to-end training speedup of 2.05×, 1.55×, 2.05×, 1.94×, 1.81×, and 1.56× for AG News, Sogou News, IMDb, Yelp F., SQuAD2.0 and SWAG respectively. We can also see that the freezing speedup is on the same scale across different runs.…”

Section: A1 Complete Resultsmentioning

confidence: 99%

“…One direct approach to reduce the computational cost is to only fine-tune a subset of the layers [22]. For example, as shown in Figure 2, only updating the last k layers of the BERT can lead to an almost linear decrease in time taken per iteration.…”

Section: Statically Freezing Model Layersmentioning

confidence: 99%

“…First we consider simple static freezing schemes where a fixed number of layers are chosen to be updated during training as presented in [22]. In Figure 4 we compare static freezing schemes when fine-tuning BERT BASE with IMDb and Sogou dataset.…”

Section: Statically Freezing Model Layersmentioning

confidence: 99%

“…A natural approach to improve fine-tuning performance is to limit the number of layers of the model that are updated, thus making it similar to transfer learning. For example, if we consider BERT BASE which has 12 encoding blocks, prior approach in [22] trains a fixed number of blocks (e.g., the last 4 blocks) and freezes the weights for the remaining blocks. However, this approach can affect the final model accuracy.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

Liu¹,

Agarwal²,

Venkataraman³

2021

Preprint

View full text Add to dashboard Cite

With the rapid adoption of machine learning (ML), a number of domains now use the approach of fine-tuning models pre-trained on a large corpus of data. However, our experiments show that even fine-tuning on models like BERT can take many hours when using GPUs. While prior work proposes limiting the number of layers that are fine-tuned, e.g., freezing all layers but the last layer, we find that such static approaches lead to reduced accuracy. We propose, AutoFreeze, a system that uses an adaptive approach to choose which layers are trained and show how this can accelerate model fine-tuning while preserving accuracy. We also develop mechanisms to enable efficient caching of intermediate activations which can reduce the forward computation time when performing fine-tuning. Our evaluation on four NLP tasks shows that AutoFreeze, with caching enabled, can improve fine-tuning performance by up to 2.55×.

show abstract

Section: A1 Complete Resultsmentioning

confidence: 99%

Section: Statically Freezing Model Layersmentioning

confidence: 99%

Section: Statically Freezing Model Layersmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

AutoFreeze: Automatically Freezing Model Blocks to Accelerate Fine-tuning

Liu¹,

Agarwal²,

Venkataraman³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We observe that in transfer learning, freezing layers is mainly used for solving the overfitting problem [24]. While techniques such as static freezing [49] and cosine annealing [12] can reduce backward computation cost, accuracy loss is a common side effect. Thus, the main challenge of extending layer freezing to generic DNN training is how to maintain accuracy by only freezing the converged layers.…”

Section: Introductionmentioning

confidence: 99%

Efficient DNN Training with Knowledge-Guided Layer Freezing

Wang¹,

Sun²,

Chen³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training deep neural networks (DNNs) is time-consuming. While most existing solutions try to overlap/schedule computation and communication for efficient training, this paper goes one step further by skipping computing and communication through DNN layer freezing. Our key insight is that the training progress of internal DNN layers differs significantly, and front layers often become well-trained much earlier than deep layers. To explore this, we first introduce the notion of training plasticity to quantify the training progress of internal DNN layers. Then we design KGT, a knowledge-guided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layers' training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication. Our reference model is generated on the fly using quantization techniques and runs forward operations asynchronously on available CPUs to minimize the overhead. In addition, KGT caches the intermediate outputs of the frozen layers with prefetching to further skip the forward computation. Our implementation and testbed experiments with popular vision and language models show that KGT achieves 19%-43% training speedup w.r.t. the state-of-the-art without sacrificing accuracy.

show abstract