Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020
DOI: 10.18653/v1/2020.acl-main.197
|View full text |Cite
|
Sign up to set email alerts
|

SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization

Abstract: Transfer learning has fundamentally changed the landscape of natural language processing (NLP). Many state-of-the-art models are first pre-trained on a large text corpus and then fine-tuned on downstream tasks. However, due to limited data resources from downstream tasks and the extremely high complexity of pre-trained models, aggressive fine-tuning often causes the fine-tuned model to overfit the training data of downstream tasks and fail to generalize to unseen data. To address such an issue in a principled … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
230
0
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
4
4
1

Relationship

2
7

Authors

Journals

citations
Cited by 248 publications
(233 citation statements)
references
References 41 publications
2
230
0
1
Order By: Relevance
“…Especially for the tasks with smaller training data (<10k), our method can achieve significant improvements (+1.7% on average) compared to the vanilla fine-tuning method. Because of the data scarcity, vanilla fine-tuning on these tasks is potentially brittle and prone to overfitting and catastrophic forgetting problems (Phang et al, 2018;Jiang et al, 2019). With the proposed RECADAM method, we successfully achieve better fine-tuning by learning target tasks while recalling the knowledge of pretraining tasks.…”
Section: Results On Gluementioning
confidence: 99%
See 1 more Smart Citation
“…Especially for the tasks with smaller training data (<10k), our method can achieve significant improvements (+1.7% on average) compared to the vanilla fine-tuning method. Because of the data scarcity, vanilla fine-tuning on these tasks is potentially brittle and prone to overfitting and catastrophic forgetting problems (Phang et al, 2018;Jiang et al, 2019). With the proposed RECADAM method, we successfully achieve better fine-tuning by learning target tasks while recalling the knowledge of pretraining tasks.…”
Section: Results On Gluementioning
confidence: 99%
“…Many approaches have been proposed to overcome the forgetting problem in various domains, such as machine translation (Miceli-Barone et al, 2017;Thompson et al, 2019) and reading comprehension . As sequential transfer learning widely used for NLP tasks (Howard and Ruder, 2018;Devlin et al, 2019;Lan et al, 2020), previous works explore many finetuning tricks to reduce catastrophic forgetting for adaptation of the deep pretrained LMs (Howard and Ruder, 2018;Sun et al, 2019;Zhang et al, 2019;Jiang et al, 2019;Lee et al, 2020). In this paper, we bring the idea of multi-task learning which can inherently avoid catastrophic forgetting, and achieve consistent improvement with the proposed RECADAM optimizer.…”
Section: Related Workmentioning
confidence: 99%
“…In future work, we plan to further investigate how different techniques apply to the problem of text segmentation, including data augmentation (Wei and Zou, 2019;Lukasik et al, 2020b) and methods for regularization and mitigating labeling noise (Jiang et al, 2020;Lukasik et al, 2020a).…”
Section: Discussionmentioning
confidence: 99%
“…The documents are tokenized using wordpieces and are chopped to spans no longer than 150 tokens on 20NG 15 and 20NG and 256 on other datasets.. Hyper-parameters: For our method, we use λ on = λ off = 1, δ on = 10 −4 , δ off = 10 −3 and δ y = 0.1 for all the datasets. We then conduct an extensive hyper-parameter search for the baselines: for label smoothing, we search the smoothing parameter from {0.05, 0.1} as in (Müller et al, 2019); for ERL, the penalty weight is chosen from {0.05, 0.1, 0.25, 0.5, 1, 2.5, 5}; for VAT, we search the perturbation size in {10 −3 , 10 −4 , 10 −5 } as in (Jiang et al, 2020); for Mixup, we search the interpolation parameter from {0.1, 0.2, 0.3, 0.4} as suggested in (Zhang et al, 2018;Thulasidasan et al, 2019); for Manifold-mixup, we search from {0.2, 0.4, 1, 2, 4}. We perform 10 stochastic forward passes for MCDP at test time.…”
Section: B Experiments Detailsmentioning
confidence: 99%