Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1595
|View full text |Cite
|
Sign up to set email alerts
|

BAM! Born-Again Multi-Task Networks for Natural Language Understanding

Abstract: It can be challenging to train multi-task neural networks that outperform or even match their single-task counterparts. To help address this, we propose using knowledge distillation where single-task models teach a multi-task model. We enhance this training with teacher annealing, a novel method that gradually transitions the model from distillation to supervised learning, helping the multi-task model surpass its single-task teachers. We evaluate our approach by multi-task fine-tuning BERT on the GLUE benchmar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
176
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 188 publications
(176 citation statements)
references
References 34 publications
0
176
0
Order By: Relevance
“…As a result, knowledge distillation has become another key feature in this new learning paradigm. An effective distillation step can often substantially compress a large model for efficient deployment (Clark et al, 2019;Tang et al, 2019;Liu et al, 2019a).…”
Section: Introductionmentioning
confidence: 99%
“…As a result, knowledge distillation has become another key feature in this new learning paradigm. An effective distillation step can often substantially compress a large model for efficient deployment (Clark et al, 2019;Tang et al, 2019;Liu et al, 2019a).…”
Section: Introductionmentioning
confidence: 99%
“…These efforts have tried multiple teacher distillation methods in the field of computer vision, but little research has been done on the NLP deep pre-training based model. Concurrently with our work, several works also combine the multi-task learning with knowledge distillation [2,18,19]. However, they applied the knowledge distillation and multi-task learning to enhance the original model performance, instead of targeting model compression.…”
Section: Multi-task Learningmentioning
confidence: 86%
“…Therefore, it has captured a rough language model from large corpus. 2 The distillation pre-trained model of stage 1 will be released soon.…”
Section: Tmkd Architecturementioning
confidence: 99%
See 1 more Smart Citation
“…Recent work on multi-task learning has focused on designing effective neural architectures (Hashimoto et al, 2017;Søgaard and Goldberg, 2016;Sanh et al, 2018;Ruder et al, 2017). Combining these two lines of work, Clark et al, 2019) explored fine-tuning the contextualized models with multiple natural language understanding tasks. In this work, we depart from previous works by specifically studying the effects of multi-task fine-tuning for the stance prediction task with pre-trained models.…”
Section: Related Workmentioning
confidence: 99%