2019
DOI: 10.48550/arxiv.1911.03090
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

What Would Elsa Do? Freezing Layers During Transformer Fine-Tuning

Abstract: Pretrained transformer-based language models have achieved state of the art across countless tasks in natural language processing. These models are highly expressive, comprising at least a hundred million parameters and a dozen layers. Recent evidence suggests that only a few of the final layers need to be fine-tuned for high quality on downstream tasks. Naturally, a subsequent research question is, "how many of the last layers do we need to fine-tune?" In this paper, we precisely answer this question. We exam… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
24
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 20 publications
(24 citation statements)
references
References 12 publications
0
24
0
Order By: Relevance
“…We observe that we gain similar speedup and max accuracy for all the runs for each dataset. We also include the test accuracy convergence curve with respect to time for each of the three repeated runs using stepped learning rate schedule for each dataset in Figures 19,20,21,22,23,and 24. We see that AutoFreeze and full fine-tuning achieve comparable max accuracy with an average end-to-end training speedup of 2.05×, 1.55×, 2.05×, 1.94×, 1.81×, and 1.56× for AG News, Sogou News, IMDb, Yelp F., SQuAD2.0 and SWAG respectively. We can also see that the freezing speedup is on the same scale across different runs.…”
Section: A1 Complete Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…We observe that we gain similar speedup and max accuracy for all the runs for each dataset. We also include the test accuracy convergence curve with respect to time for each of the three repeated runs using stepped learning rate schedule for each dataset in Figures 19,20,21,22,23,and 24. We see that AutoFreeze and full fine-tuning achieve comparable max accuracy with an average end-to-end training speedup of 2.05×, 1.55×, 2.05×, 1.94×, 1.81×, and 1.56× for AG News, Sogou News, IMDb, Yelp F., SQuAD2.0 and SWAG respectively. We can also see that the freezing speedup is on the same scale across different runs.…”
Section: A1 Complete Resultsmentioning
confidence: 99%
“…One direct approach to reduce the computational cost is to only fine-tune a subset of the layers [22]. For example, as shown in Figure 2, only updating the last k layers of the BERT can lead to an almost linear decrease in time taken per iteration.…”
Section: Statically Freezing Model Layersmentioning
confidence: 99%
See 2 more Smart Citations
“…We observe that in transfer learning, freezing layers is mainly used for solving the overfitting problem [24]. While techniques such as static freezing [49] and cosine annealing [12] can reduce backward computation cost, accuracy loss is a common side effect. Thus, the main challenge of extending layer freezing to generic DNN training is how to maintain accuracy by only freezing the converged layers.…”
Section: Introductionmentioning
confidence: 99%