SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization

Jiang, Haoming; He, Pengcheng; Chen, Weizhu; Liu, Xiaodong; Gao, Jianfeng; Zhao, Tuo

doi:10.18653/v1/2020.acl-main.197

Cited by 248 publications

(233 citation statements)

References 41 publications

Supporting

Mentioning

230

Contrasting

Unclassified

Order By: Relevance

“…Especially for the tasks with smaller training data (<10k), our method can achieve significant improvements (+1.7% on average) compared to the vanilla fine-tuning method. Because of the data scarcity, vanilla fine-tuning on these tasks is potentially brittle and prone to overfitting and catastrophic forgetting problems (Phang et al, 2018;Jiang et al, 2019). With the proposed RECADAM method, we successfully achieve better fine-tuning by learning target tasks while recalling the knowledge of pretraining tasks.…”

Section: Results On Gluementioning

confidence: 99%

“…Many approaches have been proposed to overcome the forgetting problem in various domains, such as machine translation (Miceli-Barone et al, 2017;Thompson et al, 2019) and reading comprehension . As sequential transfer learning widely used for NLP tasks (Howard and Ruder, 2018;Devlin et al, 2019;Lan et al, 2020), previous works explore many finetuning tricks to reduce catastrophic forgetting for adaptation of the deep pretrained LMs (Howard and Ruder, 2018;Sun et al, 2019;Zhang et al, 2019;Jiang et al, 2019;Lee et al, 2020). In this paper, we bring the idea of multi-task learning which can inherently avoid catastrophic forgetting, and achieve consistent improvement with the proposed RECADAM optimizer.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Chen¹,

Hou²,

Cui³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Deep pretrained language models have achieved great success in the way of pretraining first and then fine-tuning. But such a sequential transfer learning paradigm often confronts the catastrophic forgetting problem and leads to sub-optimal performance. To fine-tune with less forgetting, we propose a recall and learn mechanism, which adopts the idea of multi-task learning and jointly learns pretraining tasks and downstream tasks. Specifically, we introduce a Pretraining Simulation mechanism to recall the knowledge from pretraining tasks without data, and an Objective Shifting mechanism to focus the learning on downstream tasks gradually. Experiments show that our method achieves state-of-the-art performance on the GLUE benchmark. Our method also enables BERTbase to achieve better average performance than directly fine-tuning of BERT-large. Further, we provide the open-source RECADAM optimizer, which integrates the proposed mechanisms into Adam optimizer, to facility the NLP community. 1

show abstract

Section: Results On Gluementioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Chen¹,

Hou²,

Cui³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…In future work, we plan to further investigate how different techniques apply to the problem of text segmentation, including data augmentation (Wei and Zou, 2019;Lukasik et al, 2020b) and methods for regularization and mitigating labeling noise (Jiang et al, 2020;Lukasik et al, 2020a).…”

Section: Discussionmentioning

confidence: 99%

Text Segmentation by Cross Segment Attention

Łukasik¹,

Dadachev²,

Papineni³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Document and discourse segmentation are two fundamental NLP tasks pertaining to breaking up text into constituents, which are commonly used to help downstream tasks such as information retrieval or text summarization. In this work, we propose three transformer-based architectures and provide comprehensive comparisons with previously proposed approaches on three standard datasets. We establish a new state-of-the-art, reducing in particular the error rates by a large margin in all cases. We further analyze model sizes and find that we can build models with many fewer parameters while keeping good performance, thus facilitating real-world applications. Early life and marriage:Franklin Delano Roosevelt was born on January 30, 1882, in the Hudson Valley town of Hyde Park, New York, to businessman James Roosevelt I and his second wife, Sara Ann Delano. (...) Aides began to refer to her at the time as "the president's girlfriend", and gossip linking the two romantically appeared in the newspapers.(...) Legacy: Roosevelt is widely considered to be one of the most important figures in the history of the United States, as well as one of the most influential figures of the 20th century. (...) Roosevelt has also appeared on several U.S. Postage stamps.

show abstract

“…The documents are tokenized using wordpieces and are chopped to spans no longer than 150 tokens on 20NG 15 and 20NG and 256 on other datasets.. Hyper-parameters: For our method, we use λ on = λ off = 1, δ on = 10 −4 , δ off = 10 −3 and δ y = 0.1 for all the datasets. We then conduct an extensive hyper-parameter search for the baselines: for label smoothing, we search the smoothing parameter from {0.05, 0.1} as in (Müller et al, 2019); for ERL, the penalty weight is chosen from {0.05, 0.1, 0.25, 0.5, 1, 2.5, 5}; for VAT, we search the perturbation size in {10 −3 , 10 −4 , 10 −5 } as in (Jiang et al, 2020); for Mixup, we search the interpolation parameter from {0.1, 0.2, 0.3, 0.4} as suggested in (Zhang et al, 2018;Thulasidasan et al, 2019); for Manifold-mixup, we search from {0.2, 0.4, 1, 2, 4}. We perform 10 stochastic forward passes for MCDP at test time.…”

Section: B Experiments Detailsmentioning

confidence: 99%

Calibrated Language Model Fine-Tuning for In- and Out-of-Distribution Data

Kong

Jiang

Zhuang³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

Self Cite

View full text Add to dashboard Cite

Fine-tuned pre-trained language models can suffer from severe miscalibration for both in-distribution and out-of-distribution (OOD) data due to over-parameterization. To mitigate this issue, we propose a regularized fine-tuning method.Our method introduces two types of regularization for better calibration: (1) On-manifold regularization, which generates pseudo on-manifold samples through interpolation within the data manifold. Augmented training with these pseudo samples imposes a smoothness regularization to improve in-distribution calibration. (2) Off-manifold regularization, which encourages the model to output uniform distributions for pseudo off-manifold samples to address the over-confidence issue for OOD data. Our experiments demonstrate that the proposed method outperforms existing calibration methods for text classification in terms of expectation calibration error, misclassification detection, and OOD detection on six datasets. Our code can be found at https://github.com/Lingkai-Kong/ Calibrated-BERT-Fine-Tuning.

show abstract

SMART: Robust and Efficient Fine-Tuning for Pre-trained Natural Language Models through Principled Regularized Optimization

Cited by 248 publications

References 41 publications

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Recall and Learn: Fine-tuning Deep Pretrained Language Models with Less Forgetting

Text Segmentation by Cross Segment Attention

Calibrated Language Model Fine-Tuning for In- and Out-of-Distribution Data

Contact Info

Product

Resources

About