Proceedings of ACL 2018, System Demonstrations 2018
DOI: 10.18653/v1/p18-4022
|View full text |Cite
|
Sign up to set email alerts
|

RETURNN as a Generic Flexible Neural Toolkit with Application to Translation and Speech Recognition

Abstract: We compare the fast training and decoding speed of RETURNN of attention models for translation, due to fast CUDA LSTM kernels, and a fast pure Tensor-Flow beam search decoder. We show that a layer-wise pretraining scheme for recurrent attention models gives over 1% BLEU improvement absolute and it allows to train deeper recurrent encoder networks. Promising preliminary results on max. expected BLEU training are presented. We obtain state-of-the-art models trained on the WMT 2017 German↔English translation task… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
50
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5

Relationship

1
4

Authors

Journals

citations
Cited by 63 publications
(50 citation statements)
references
References 31 publications
0
50
0
Order By: Relevance
“…An MP layer has no trainable parameters, but helps primarily in reducing the length of an input sequence by a predetermined pooling factor. It has been observed that reducing the input sequence length progressively through the encoder helps in better convergence and accuracy of the AED models [2,8,16]. In this work, we use an overall time reduction factor of 2 for character encoder, and an additional reduction factor of 4 in the BPE stack, taking the overall reduction factor for the BPE encoder to 8.…”
Section: Attention Based Encoder-decoder Modelsmentioning
confidence: 99%
See 4 more Smart Citations
“…An MP layer has no trainable parameters, but helps primarily in reducing the length of an input sequence by a predetermined pooling factor. It has been observed that reducing the input sequence length progressively through the encoder helps in better convergence and accuracy of the AED models [2,8,16]. In this work, we use an overall time reduction factor of 2 for character encoder, and an additional reduction factor of 4 in the BPE stack, taking the overall reduction factor for the BPE encoder to 8.…”
Section: Attention Based Encoder-decoder Modelsmentioning
confidence: 99%
“…In [9], a teacher-student transfer learning has been used to improve the convergence as well as the performance of character based CTC encoder models. In the case of AED models, it has been shown that a carefully designed layerwise pre-training strategy helps the models to converge better [2]. Similarly, a joint multi-task training using a CTC loss on the encoder and a CE loss on the attention decoder was proposed to improve the convergence and accuracies of AED models [19,20].…”
Section: Multi-stage Multi-task Training Of Online Attention Modelsmentioning
confidence: 99%
See 3 more Smart Citations