Hard-Coded Gaussian Attention for Neural Machine Translation

You, Weiqiu; Sun, Simeng; Iyyer, Mohit

doi:10.48550/arxiv.2005.00742

Cited by 8 publications

(17 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This suggests the power of simple n-gram models may be underestimated previously, as they are typically trained from scratch, without modern techniques such as pre-training and knowledge distillation. This also echoes with a series of recent work that questions the necessity of word order information (Sinha et al, 2021) and self-attention (You et al, 2020) We provide more details and list the inference speed for IMDB and SST-2 in Table 3. We have previously visualized the speed comparison on IMDB dataset on in Fig.…”

Section: Resultssupporting

confidence: 69%

“…One weakness of DANs is that they are restricted in modeling high-level meanings in long-range contexts, as compared to the self-attention operator in Transformers. However, recent studies have shown that large pre-trained Transformers are rather insensitive to word order (Sinha et al, 2021) and that they still work well when the learned selfattention is replaced with hard-coded localized attention (You et al, 2020). Taken together, these studies suggest that on some tasks it may be possible to get competitive results without computationally expensive operations such as self-attention.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Ye¹,

Khabsa²,

Lewis³

et al. 2021

Preprint

View full text Add to dashboard Cite

Distilling state-of-the-art transformer models into lightweight student models is an effective way to reduce computation cost at inference time. However, the improved inference speed may be still unsatisfactory for certain timesensitive applications. In this paper, we aim to further push the limit of inference speed by exploring a new area in the design space of the student model. More specifically, we consider distilling a transformer-based text classifier into a billion-parameter, sparsely-activated student model with a embedding-averaging architecture. Our experiments show that the student models retain 97% of the RoBERTa-Large teacher performance on a collection of six text classification tasks. Meanwhile, the student model achieves up to 600x speed-up on both GPUs and CPUs, compared to the teacher models. Further investigation shows that our pipeline is also effective in privacy-preserving and domain generalization settings.

show abstract

Section: Resultssupporting

confidence: 69%

Section: Introductionmentioning

confidence: 99%

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Ye¹,

Khabsa²,

Lewis³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Section: The Power Of Attention To Model Semantic Relationshipsmentioning

confidence: 99%

“…Motivated by the observation that most learned attention heads trained on Neural Machine Translation (NMT) tasks focus on a local window around the input query position, You et al (2020) replace attention weights in the Transformer encoder and decoder with unparameterized Gaussian distributions. Provided that they retain the learnable crossattention weights between the encoder and decoder, You et al (2020) see minimal degradation in NMT BLEU scores. Working from similar observations, Raganato et al (2020) find little to no accuracy degradation on NMT tasks when they replace all but one of the attention heads of each attention sublayer in the Transformer encoder with fixed, non-learnable positional patterns; note that, in their setup, the decoder retains all of its learnable parameters.…”

Section: The Power Of Attention To Model Semantic Relationshipsmentioning

confidence: 99%

“…The results of both You et al (2020) and Raganato et al (2020) suggest that most connections in the attention sublayer in the encoder -and possibly the decoder -do not need to be learned at all, but can be replaced by predefined patterns. While reasonable, this conclusion is somewhat obscured by the learnable attention heads that remain in the decoder and/or the cross-attention weights between the encoder and decoder.…”

Section: The Power Of Attention To Model Semantic Relationshipsmentioning

confidence: 99%

See 1 more Smart Citation

FNet: Mixing Tokens with Fourier Transforms

Lee-Thorp¹,

Ainslie²,

Eckstein³

et al. 2021

Preprint

View full text Add to dashboard Cite

We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with simple nonlinearities in feed-forward layers, are sufficient to model semantic relationships in several text classification tasks. Perhaps most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92% of the accuracy of BERT on the GLUE benchmark, but pre-trains and runs up to seven times faster on GPUs and twice as fast on TPUs. The resulting model, which we name FNet, scales very efficiently to long inputs, matching the accuracy of the most accurate "efficient" Transformers on the Long Range Arena benchmark, but training and running faster across all sequence lengths on GPUs and relatively shorter sequence lengths on TPUs. Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.

show abstract

Enhancing Multivariate Time Series Classifiers Through Self-Attention and Relative Positioning Infusion

Abbasi,

Saeedi

2024

IEEE Access

View full text Add to dashboard Cite

Time Series Classification (TSC) is an important and challenging task for many visual computing applications. Despite the extensive range of methods developed for TSC, only a few are based on Deep Neural Networks (DNNs). In this paper, we present two novel attention blocks: (Global Temporal Attention and Temporal Pseudo-Gaussian Augmented Self-Attention) that can enhance deep learning-based TSC approaches, even when such approaches are designed and optimized for specific datasets or tasks. We validate the performance of the proposed blocks using multiple state-of-the-art deep learning-based TSC models on the University of East Anglia (UEA) benchmark, including a standardized collection of 30 Multivariate Time Series Classification (MTSC) datasets. We demonstrate that adding the proposed attention blocks increases base models' average accuracy by up to 3.6%. Additionally, the proposed TPS block uses a new injection module to include the relative positional information in transformers. As a standalone unit with less computational complexity, it enables TPS to perform better than most of the state-of-the-art DNNbased TSC methods. The source codes for our setups and the attention blocks are publicly available a .

show abstract

Hard-Coded Gaussian Attention for Neural Machine Translation

Cited by 8 publications

References 15 publications

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

Sparse Distillation: Speeding Up Text Classification by Using Bigger Student Models

FNet: Mixing Tokens with Fourier Transforms

Enhancing Multivariate Time Series Classifiers Through Self-Attention and Relative Positioning Infusion

Contact Info

Product

Resources

About