Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

Arivazhagan, Naveen; Cherry, Colin; Macherey, Wolfgang; Chiu, Chung‐Cheng; Yavuz, Semih; Pang, Ruoming; Wei, Li; Raffel, Colin

doi:10.18653/v1/p19-1126

Cited by 147 publications

(234 citation statements)

References 16 publications

Supporting

Mentioning

232

Contrasting

Order By: Relevance

“…For the MoChA model we use chunk size of 8. In the MILK model we applied a latency loss different from the the one proposed in [17], as the original latency loss is tailored for machine translation where the source and target sequence have similar length. Our latency loss minimize the root-mean-square value of the interval between two consecutive emissions: Table 1: Word-error-rates of end-to-end models on YouTube test set.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

A Comparison of End-to-End Models for Long-Form Speech Recognition

Chiu

Kannan

Prabhavalkar

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

End-to-end automatic speech recognition (ASR) models, including both attention-based models and the recurrent neural network transducer (RNN-T), have shown superior performance compared to conventional systems [1,2]. However, previous studies have focused primarily on short utterances that typically last for just a few seconds or, at most, a few tens of seconds. Whether such architectures are practical on long utterances that last from minutes to hours remains an open question. In this paper, we both investigate and improve the performance of end-to-end models on long-form transcription. We first present an empirical comparison of different end-to-end models on a real world long-form task and demonstrate that the RNN-T model is much more robust than attention-based systems in this regime. We next explore two improvements to attention-based systems that significantly improve its performance: restricting the attention to be monotonic, and applying a novel decoding algorithm that breaks long utterances into shorter overlapping segments. Combining these two improvements, we show that attentionbased end-to-end models can be very competitive to RNN-T on long-form speech recognition.

show abstract

Section: Methodsmentioning

confidence: 99%

“…This fixed window size may still limit the full potential of the attention mechanism. The monotonic infinite lookback attention (MILK) mechanism was proposed in [17] to allow the attention window to look back all the way to the beginning of the sequence.…”

Section: Monotonic Infinite Lookback Attentionmentioning

confidence: 99%

A Comparison of End-to-End Models for Long-Form Speech Recognition

Chiu

Kannan

Prabhavalkar

et al. 2019

2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)

Self Cite

View full text Add to dashboard Cite

show abstract

“…With this refined g, we can make several latency metrics content-aware, including average proportion (Cho and Esipova, 2016), consecutive wait (Gu et al, 2017), average lagging , and differentiable average lagging (Arivazhagan et al, 2019b). We opt for differentiable average lagging (DAL) because of its interpretability and because it sidesteps some problems with average lagging (Cherry and Foster, 2019).…”

Section: Latencymentioning

confidence: 99%

“…We employ their wait-k training as a baseline, and use their wait-k inference to improve re-translation. Our second and strongest streaming baseline is the MILk approach of Arivazhagan et al (2019b), who improve upon wait-k training with a hierarchical attention that can adapt how it will wait based on the current context. Both wait-k training and MILk attention provide hyper-parameters to control their quality-latency trade-offs: k for wait-k, and latency weight for MILk.…”

Section: Introductionmentioning

confidence: 99%

Re-Translation Strategies for Long Form, Simultaneous, Spoken Language Translation

Arivazhagan

Cherry

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

There has been great progress in improving streaming machine translation, a simultaneous paradigm where the system appends to a growing hypothesis as more source content becomes available. We study a related problem in which revisions to the hypothesis beyond strictly appending words are permitted. This is suitable for applications such as live captioning an audio feed. In this setting, we compare custom streaming approaches to re-translation, a straightforward strategy where each new source token triggers a distinct translation from scratch. We find retranslation to be as good or better than stateof-the-art streaming systems, even when operating under constraints that allow very few revisions. We attribute much of this success to a previously proposed data-augmentation technique that adds prefix-pairs to the training data, which alongside wait-k inference forms a strong baseline for streaming translation. We also highlight re-translation's ability to wrap arbitrarily powerful MT systems with an experiment showing large improvements from an upgrade to its base model.

show abstract

“…When the source sentence ends, the decoder can do a tail beam search on the remaining target words, but beam search is seemingly impossible before the source sentence ends. policy model sequence-to-sequence prefix-to-prefix (full-sentence model) (simultaneous model) fixedlatency test-time wait-k (Dalvi et al, 2018; wait-k adaptive RL MILk (Gu et al, 2017) (Arivazhagan et al, 2019) Supervised Learning Imitation Learning ) 2. The second method learns an adaptive policy which uses either supervised or reinforcement learning (Grissom II et al, 2014;Gu et al, 2017) to decide whether to READ (the next source word) or WRITE (the next target word) .…”

Section: Simultaneous Mt: Policies and Modelsmentioning

confidence: 99%

Speculative Beam Search for Simultaneous Translation

Zheng

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Beam search is universally used in fullsentence translation but its application to simultaneous translation remains non-trivial, where output words are committed on the fly. In particular, the recently proposed wait-k policy (Ma et al., 2019a) is a simple and effective method that (after an initial wait) commits one output word on receiving each input word, making beam search seemingly impossible. To address this challenge, we propose a speculative beam search algorithm that hallucinates several steps into the future in order to reach a more accurate decision, implicitly benefiting from a target language model. This makes beam search applicable for the first time to the generation of a single word in each step. Experiments over diverse language pairs show large improvements over previous work.

show abstract

Monotonic Infinite Lookback Attention for Simultaneous Machine Translation

Cited by 147 publications

References 16 publications

A Comparison of End-to-End Models for Long-Form Speech Recognition

A Comparison of End-to-End Models for Long-Form Speech Recognition

Re-Translation Strategies for Long Form, Simultaneous, Spoken Language Translation

Speculative Beam Search for Simultaneous Translation

Contact Info

Product

Resources

About