2020
DOI: 10.48550/arxiv.2005.00742
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Hard-Coded Gaussian Attention for Neural Machine Translation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
16
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(17 citation statements)
references
References 15 publications
1
16
0
Order By: Relevance
“…This suggests the power of simple n-gram models may be underestimated previously, as they are typically trained from scratch, without modern techniques such as pre-training and knowledge distillation. This also echoes with a series of recent work that questions the necessity of word order information (Sinha et al, 2021) and self-attention (You et al, 2020) We provide more details and list the inference speed for IMDB and SST-2 in Table 3. We have previously visualized the speed comparison on IMDB dataset on in Fig.…”
Section: Resultssupporting
confidence: 69%
See 1 more Smart Citation
“…This suggests the power of simple n-gram models may be underestimated previously, as they are typically trained from scratch, without modern techniques such as pre-training and knowledge distillation. This also echoes with a series of recent work that questions the necessity of word order information (Sinha et al, 2021) and self-attention (You et al, 2020) We provide more details and list the inference speed for IMDB and SST-2 in Table 3. We have previously visualized the speed comparison on IMDB dataset on in Fig.…”
Section: Resultssupporting
confidence: 69%
“…One weakness of DANs is that they are restricted in modeling high-level meanings in long-range contexts, as compared to the self-attention operator in Transformers. However, recent studies have shown that large pre-trained Transformers are rather insensitive to word order (Sinha et al, 2021) and that they still work well when the learned selfattention is replaced with hard-coded localized attention (You et al, 2020). Taken together, these studies suggest that on some tasks it may be possible to get competitive results without computationally expensive operations such as self-attention.…”
Section: Introductionmentioning
confidence: 99%
“…Motivated by the observation that most learned attention heads trained on Neural Machine Translation (NMT) tasks focus on a local window around the input query position, You et al (2020) replace attention weights in the Transformer encoder and decoder with unparameterized Gaussian distributions. Provided that they retain the learnable crossattention weights between the encoder and decoder, You et al (2020) see minimal degradation in NMT BLEU scores.…”
Section: The Power Of Attention To Model Semantic Relationshipsmentioning
confidence: 99%
“…Motivated by the observation that most learned attention heads trained on Neural Machine Translation (NMT) tasks focus on a local window around the input query position, You et al (2020) replace attention weights in the Transformer encoder and decoder with unparameterized Gaussian distributions. Provided that they retain the learnable crossattention weights between the encoder and decoder, You et al (2020) see minimal degradation in NMT BLEU scores. Working from similar observations, Raganato et al (2020) find little to no accuracy degradation on NMT tasks when they replace all but one of the attention heads of each attention sublayer in the Transformer encoder with fixed, non-learnable positional patterns; note that, in their setup, the decoder retains all of its learnable parameters.…”
Section: The Power Of Attention To Model Semantic Relationshipsmentioning
confidence: 99%
See 1 more Smart Citation