Proceedings of the 3rd Workshop on Neural Generation and Translation 2019
DOI: 10.18653/v1/d19-5632
|View full text |Cite
|
Sign up to set email alerts
|

From Research to Production and Back: Ludicrously Fast Neural Machine Translation

Abstract: This paper describes the submissions of the "Marian" team to the WNGT 2019 efficiency shared task. Taking our dominating submissions to the previous edition of the shared task as a starting point, we develop improved teacher-student training via multi-agent duallearning and noisy backward-forward translation for Transformer-based student models. For efficient CPU-based decoding, we propose pre-packed 8-bit matrix products, improved batched decoding, cache-friendly student architectures with parameter sharing a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
71
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
4

Relationship

1
8

Authors

Journals

citations
Cited by 54 publications
(73 citation statements)
references
References 11 publications
2
71
0
Order By: Relevance
“…Team Marian's submission (Kim et al, 2019) was based on their submission to the shared task the previous year, consisting of Transformer models optimized in a number of ways (Junczys-Dowmunt et al, 2018 a number of improvements. Improvements were made to teacher-student training by (1) creating more data for teacher-student training using backward, then forward translation, (2) using multiple teachers to generate better distilled data for training student models.…”
Section: Team Marianmentioning
confidence: 99%
“…Team Marian's submission (Kim et al, 2019) was based on their submission to the shared task the previous year, consisting of Transformer models optimized in a number of ways (Junczys-Dowmunt et al, 2018 a number of improvements. Improvements were made to teacher-student training by (1) creating more data for teacher-student training using backward, then forward translation, (2) using multiple teachers to generate better distilled data for training student models.…”
Section: Team Marianmentioning
confidence: 99%
“…This problem setup is closely related to model distillation (Hinton et al, 2014): training a student model to imitate the predictions of a teacher. Distillation has widespread use in MT, including reducing architecture size (Kim and Rush, 2016;Kim et al, 2019), creating multilingual models (Tan et al, 2019), and improving non-autoregressive generation (Ghazvininejad et al, 2019;Stern et al, 2019). Model stealing differs from distillation because the victim's (i.e., teacher's) training data is unknown.…”
Section: Past Work On Distillation and Stealingmentioning
confidence: 99%
“…The goal of the distillation step is to reduce the complexity of the original training data D tr . Instead of achieving this with a single deep teacher as in Kim and Rush (2016) and Kim et al (2019), we use the multiple domainspecific teachers trained in step 1. Each training set D tr i is translated with its corresponding deep teacher, resulting in a distilled version D dist(tr) i of that set.…”
Section: In-domain Distillationmentioning
confidence: 99%