2021
DOI: 10.48550/arxiv.2106.06899
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Memory-efficient Transformers via Top-$k$ Attention

Abstract: Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the qu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

1
1
0

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 18 publications
(34 reference statements)
1
1
0
Order By: Relevance
“…It achieves sparsification by zeroing out input entries smaller than average and provides a training-time modification strategy to enable gradient-based training. This is indeed similar to the broadly adopted top-k selection of SoftMax output, e.g., in attention layers of vision (Wang et al, 2022b;Zhao et al, 2019) and language (Gupta et al, 2021) transformers. In contrast, our MultiMax achieves sparsity and improved multi-modality at the same time without extra hyperparameters.…”
Section: Related Worksupporting
confidence: 57%
See 1 more Smart Citation
“…It achieves sparsification by zeroing out input entries smaller than average and provides a training-time modification strategy to enable gradient-based training. This is indeed similar to the broadly adopted top-k selection of SoftMax output, e.g., in attention layers of vision (Wang et al, 2022b;Zhao et al, 2019) and language (Gupta et al, 2021) transformers. In contrast, our MultiMax achieves sparsity and improved multi-modality at the same time without extra hyperparameters.…”
Section: Related Worksupporting
confidence: 57%
“…To overcome the issue, previous works have proposed sparse SoftMax alternatives, which allow to completely ignore small entries below a threshold. These sparse SoftMax variants have been studied in diverse contexts, e.g., generative modeling (Chen et al, 2021), output activations of multiclass classifiers, and/or attention mechanisms (Peters et al, 2019;Martins & Astudillo, 2016;Gupta et al, 2021).…”
Section: Introductionmentioning
confidence: 99%