Findings of the Association for Computational Linguistics: EMNLP 2021 2021
DOI: 10.18653/v1/2021.findings-emnlp.111
|View full text |Cite
|
Sign up to set email alerts
|

Combining Curriculum Learning and Knowledge Distillation for Dialogue Generation

Abstract: Curriculum learning, a machine training strategy that feeds training instances to the model from easy to hard, has been proven to facilitate the dialogue generation task. Meanwhile, knowledge distillation, a knowledge transformation methodology among teachers and students networks can yield significant performance boost for student models. Hence, in this paper, we introduce a combination of curriculum learning and knowledge distillation for efficient dialogue generation models, where curriculum learning can he… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 33 publications
0
10
0
Order By: Relevance
“…Knowledge Distillation. Knowledge distillation (KD) has been actively studied for model compression in various fields [5,11,17,37,48,55]. KD transfers the knowledge captured by a teacher model through large capacity into a lightweight student model, significantly lowering the inference cost while maintaining comparable performance.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Knowledge Distillation. Knowledge distillation (KD) has been actively studied for model compression in various fields [5,11,17,37,48,55]. KD transfers the knowledge captured by a teacher model through large capacity into a lightweight student model, significantly lowering the inference cost while maintaining comparable performance.…”
Section: Related Workmentioning
confidence: 99%
“…On the other hand, self-paced learning [21] makes the curriculum dynamically adjusted during the training, usually based on training loss [21] or performance on the validation set [6,49]. The easy-tohard learning has been applied to KD to improve the distillation efficacy in computer vision [14,39] and natural language processing [52,55]. [3,14,39] exploit the teacher's optimization route to form a curriculum for the student, [52] gradually includes an increasing number of fine-grained document pairs during the training.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Classical training algorithms sample instances from the corpus according to the static uniform distribution. Curriculum learning adopts dynamic data sampling strategy during training (Zhu et al, 2021;Guo et al, 2020;Qian et al, 2020). For example, it imitates taking well-designed easy-to-hard training courses, where "easy" instances are more likely to be sampled at early training stage, and "hard" instances are with higher sampling probabilities at late training stage.…”
Section: Mixed Sequence Distillationmentioning
confidence: 99%
“…When there are several difficult samples, such as loud noise situations, the joint framework of robust training and compression is a popular approach in various other domains. Existing studies in the fields of computer vision and natural language processing have proposed joint frameworks for combining curriculum learning and knowledge distillation methods [28], [29]. However, studies on joint optimization methods for noise robustness are scarce in the field of acoustic modeling.…”
Section: Introductionmentioning
confidence: 99%