Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input

Guo, Junliang; Tan, Xu; He, Di; Qin, Tao; Xu, Linli; Liu, Tie-Yan

doi:10.1609/aaai.v33i01.33013723

Cited by 104 publications

(99 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[20,21] combined joint n-gram models with Bi-LSTM models and achieved good performance in G2P conversion. [5] adopted convolutional sequence to sequence model and proposed the non-sequential decoding [22] for G2P conversion, which achieved the previous state-of-theart result on the public CMUDict 0.7b dataset.…”

Section: Grapheme-to-phoneme Conversionmentioning

confidence: 74%

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

et al. 2019

Self Cite

View full text Add to dashboard Cite

Grapheme-to-phoneme (G2P) conversion is an important task in automatic speech recognition and text-to-speech systems. Recently, G2P conversion is viewed as a sequence to sequence task and modeled by RNN or CNN based encoderdecoder framework. However, previous works do not consider the practical issues when deploying G2P model in the production system, such as how to leverage additional unlabeled data to boost the accuracy, as well as reduce model size for online deployment. In this work, we propose token-level ensemble distillation for G2P conversion, which can (1) boost the accuracy by distilling the knowledge from additional unlabeled data, and (2) reduce the model size but maintain the high accuracy, both of which are very practical and helpful in the online production system. We use token-level knowledge distillation, which results in better accuracy than the sequence-level counterpart. What is more, we adopt the Transformer instead of RNN or CNN based models to further boost the accuracy of G2P conversion. Experiments on the publicly available CMU-Dict dataset and an internal English dataset demonstrate the effectiveness of our proposed method. Particularly, our method achieves 19.88% WER on CMUDict dataset, outperforming the previous works by more than 4.22% WER, and setting the new state-of-the-art results.

show abstract

Section: Grapheme-to-phoneme Conversionmentioning

confidence: 74%

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…There are many design choices in the encoderdecoder framework based on different types of layers, such as RNN-based (Sutskever et al, 2014), CNN-based (Gehring et al, 2017), and selfattention based (Vaswani et al, 2017) In term of speeding up the decoding of the neural Transformer, Gu et al (2017) modified the autoregressive architecture to directly generate target words in parallel. In past two years, non-autoregressive and semi-autoregressive models have been extensively studied (Oord et al, 2017;Kaiser et al, 2018;Lee et al, 2018;Libovický and Helcl, 2018;Wang et al, 2019;Guo et al, 2018;Zhou et al, 2019a). Previous work shows that NAT can be improved via knowledge distillation from AT models.…”

Section: Related Workmentioning

confidence: 99%

Improving Autoregressive NMT with Non-Autoregressive Model

Zhang¹,

Zhang²,

Zong³

2020

Proceedings of the First Workshop on Automatic Simultaneous Translation

View full text Add to dashboard Cite

Autoregressive neural machine translation (NMT) models are often used to teach nonautoregressive models via knowledge distillation. However, there are few studies on improving the quality of autoregressive translation (AT) using non-autoregressive translation (NAT). In this work, we propose a novel Encoder-NAD-AD framework for NMT, aiming at boosting AT with global information produced by NAT model. Specifically, under the semantic guidance of source-side context captured by the encoder, the nonautoregressive decoder (NAD) first learns to generate target-side hidden state sequence in parallel. Then the autoregressive decoder (AD) performs translation from left to right, conditioned on source-side and target-side hidden states. Since AD has global information generated by low-latency NAD, it is more likely to produce a better translation with less time delay. Experiments on WMT14 En⇒De, WMT16 En⇒Ro, and IWSLT14 De⇒En translation tasks demonstrate that our framework achieves significant improvements with only 8% speed degeneration over the autoregressive NMT.

show abstract

“…Due to the multimodality problem [13], the performance of NAR model is usually inferior to AR model. Recently, a line of works aiming to bridge the performance gap between NAR and AR model for translation task has been presented [11,14].…”

Section: Non-autoregressive Decodingmentioning

confidence: 99%

FastLR

Liu

Ren

Zhao

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Lipreading is an impressive technique and there has been a definite improvement of accuracy in recent years. However, existing methods for lipreading mainly build on autoregressive (AR) model, which generate target tokens one by one and suffer from high inference latency. To breakthrough this constraint, we propose FastLR, a non-autoregressive (NAR) lipreading model which generates all target tokens simultaneously. NAR lipreading is a challenging task that has many difficulties: 1) the discrepancy of sequence lengths between source and target makes it difficult to estimate the length of the output sequence; 2) the conditionally independent behavior of NAR generation lacks the correlation across time which leads to a poor approximation of target distribution; 3) the feature representation ability of encoder can be weak due to lack of effective alignment mechanism; and 4) the removal of AR language model exacerbates the inherent ambiguity problem of lipreading. Thus, in this paper, we introduce three methods to reduce the gap between FastLR and AR model: 1) to address challenges 1 and 2, we leverage integrate-and-fire (I&F) module to model the correspondence between source video frames and output text sequence. 2) To tackle challenge 3, we add an auxiliary connectionist temporal classification (CTC) decoder to the top of the encoder and optimize it with extra CTC loss. We also add an auxiliary autoregressive decoder to help the feature extraction of encoder. 3) To overcome challenge 4, we propose a novel Noisy Parallel Decoding (NPD) for I&F and bring Byte-Pair Encoding (BPE) into lipreading. Our experiments exhibit that FastLR achieves the speedup up to 10.97× comparing with state-of-the-art lipreading model with slight WER absolute increase of 1.5% and 5.5% on GRID and LRS2 lipreading datasets respectively, which demonstrates the effectiveness of our proposed method. 1

show abstract

Non-Autoregressive Neural Machine Translation with Enhanced Decoder Input

Cited by 104 publications

References 10 publications

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion

Improving Autoregressive NMT with Non-Autoregressive Model

FastLR

Contact Info

Product

Resources

About