Look Harder: A Neural Machine Translation Model with Hard Attention

Indurthi, Sathish Reddy; Chung, Insoo; Kim, Sang-Ha

doi:10.18653/v1/p19-1290

Cited by 16 publications

(8 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gating techniques relying on sampling and straight-through gradient estimators are common (Bengio et al, 2013;Eigen et al, 2013;. Conditional computation can also be addressed with reinforcement learning (Denoyer and Gallinari, 2014;Indurthi et al, 2019). Memory augmented neural networks with sparse reads and writes have also been proposed in Rae et al (2016) as a way to scale Neural Turing Machines (Graves et al, 2014).…”

Section: Related Workmentioning

confidence: 99%

Efficient Content-Based Sparse Attention with Routing Transformers

Roy

Saffar

Vaswani

et al. 2021

Transactions of the Association for Computational Linguistics

334

181

View full text Add to dashboard Cite

Self-attention has recently been adopted for a wide range of sequence modeling problems. Despite its effectiveness, self-attention suffers from quadratic computation and memory requirements with respect to sequence length. Successful approaches to reduce this complexity focused on attending to local sliding windows or a small set of locations independent of content. Our work proposes to learn dynamic sparse attention patterns that avoid allocating computation and memory to attend to content unrelated to the query of interest. This work builds upon two lines of research: It combines the modeling flexibility of prior work on content-based sparse attention with the efficiency gains from approaches based on local, temporal sparse attention. Our model, the Routing Transformer, endows self-attention with a sparse routing module based on online k-means while reducing the overall complexity of attention to O( n1.5 d) from O( n2 d) for sequence length n and hidden dimension d. We show that our model outperforms comparable sparse attention models on language modeling on Wikitext-103 (15.8 vs 18.3 perplexity), as well as on image generation on ImageNet-64 (3.43 vs 3.44 bits/dim) while using fewer self-attention layers. Additionally, we set a new state-of-the-art on the newly released PG-19 data-set, obtaining a test perplexity of 33.2 with a 22 layer Routing Transformer model trained on sequences of length 8192. We open-source the code for Routing Transformer in Tensorflow. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Efficient Content-Based Sparse Attention with Routing Transformers

Roy

Saffar

Vaswani

et al. 2021

Transactions of the Association for Computational Linguistics

334

181

View full text Add to dashboard Cite

show abstract

“…Since the recurrent neural network operation mechanism is a left-right sequential operation process, this will significantly limit the parallel operation capability of the model itself and the sequential operation of data will also cause the problem of data module loss. e above problem will be improved by using the attention mechanism, which can change the data distance to 1 at any position in the translated data, so that it does not depend on the effect of the previous sequential operation on the current operation and the system will have better parallelism [32].…”

Section: Recurrent Neural Network With Attention Mechanismmentioning

confidence: 99%

Research on Intelligent English Translation Method Based on the Improved Attention Mechanism Model

Wang

2021

Scientific Programming

View full text Add to dashboard Cite

The use of neural machine algorithms for English translation is a hot topic in the current research. English translation using the traditional sequential neural framework, which is too poor at capturing long-distance information, has its own major limitations. However, the current improved frameworks, such as recurrent neural network translation, are not satisfactory either. In this paper, we establish an attention coding and decoding model to address the shortcomings of traditional machine translation algorithms, combine the attention mechanism with a neural network framework, and implement the whole English translation system based on TensorFlow, thus improving the translation accuracy. The experimental test results show that the BLUE values of the algorithm model built in this paper are improved to different degrees compared with the traditional machine learning algorithms, which proves that the performance of the proposed algorithm model is significantly improved compared with the traditional model.

show abstract

“…Similarly, raise the question whether 16 attention heads are really necessary to obtain competitive performance. Finally, several recent works address the computational challenge of modeling very long sequences and modify the Transformer architecture with attention operations that reduce time complexity (Shen et al, 2018;Sukhbaatar et al, 2019;Dai et al, 2019;Indurthi et al, 2019;Kitaev et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Raganato

Scherrer

Tiedemann

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Transformer-based models have brought a radical change to neural machine translation. A key feature of the Transformer architecture is the so-called multi-head attention mechanism, which allows the model to focus simultaneously on different parts of the input. However, recent works have shown that most attention heads learn simple, and often redundant, positional patterns. In this paper, we propose to replace all but one attention head of each encoder layer with simple fixed -non-learnable -attentive patterns that are solely based on position and do not require any external knowledge. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality and even increases BLEU scores by up to 3 points in low-resource scenarios.

show abstract

Look Harder: A Neural Machine Translation Model with Hard Attention

Cited by 16 publications

References 16 publications

Efficient Content-Based Sparse Attention with Routing Transformers

Efficient Content-Based Sparse Attention with Routing Transformers

Research on Intelligent English Translation Method Based on the Improved Attention Mechanism Model

Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation

Contact Info

Product

Resources

About