Proceedings of the Fourth Workshop on Neural Generation and Translation 2020
DOI: 10.18653/v1/2020.ngt-1.12
|View full text |Cite
|
Sign up to set email alerts
|

Distill, Adapt, Distill: Training Small, In-Domain Models for Neural Machine Translation

Abstract: We explore best practices for training small, memory efficient machine translation models with sequence-level knowledge distillation in the domain adaptation setting. While both domain adaptation and knowledge distillation are widely-used, their interaction remains little understood. Our large-scale empirical results in machine translation (on three language pairs with three domains each) suggest distilling twice for best performance: once using general-domain data and again using indomain data with an adapted… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2021
2021
2022
2022

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(7 citation statements)
references
References 15 publications
0
7
0
Order By: Relevance
“…It is worth noting that this method of knowledge distillation is difficult to apply when the domain of the training data is not well defined. Both Currey et al [9] and Gordon and Duh [10] use similar architectures for their models, that is, teacher models with 12 encoder and decoder layers and student models with six encoder and decoder layers. Training teacher models with this type of architecture requires a large amount of memory and GPUs.…”
Section: Introductionmentioning
confidence: 99%
“…It is worth noting that this method of knowledge distillation is difficult to apply when the domain of the training data is not well defined. Both Currey et al [9] and Gordon and Duh [10] use similar architectures for their models, that is, teacher models with 12 encoder and decoder layers and student models with six encoder and decoder layers. Training teacher models with this type of architecture requires a large amount of memory and GPUs.…”
Section: Introductionmentioning
confidence: 99%
“…A variation of forward translation uses one or more much larger or otherwise stronger 'teacher' models to generate in-domain forward translations which are then used to train or tune a 'student' model. Gordon and Duh (2020) demonstrate that even strong generic models can benefit from training on in-domain forward translations from a teacher. Currey et al (2020) extend this idea to train a multi-domain system using forward translations from multiple separate in-domain teachers.…”
Section: Forward Translationmentioning
confidence: 91%
“…All these works focus on the distillation of general-purpose language models. Gordon and Duh (2020) investigated the interaction between KD and Domain Adaptation.…”
Section: Related Workmentioning
confidence: 99%