2020
DOI: 10.48550/arxiv.2009.14794
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Rethinking Attention with Performers

Abstract: We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attentionkernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

3
349
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 205 publications
(352 citation statements)
references
References 22 publications
3
349
0
Order By: Relevance
“…More recently, in the context of transformer architectures, a number of approximations have been proposed to reduce the complexity of such computations to be linear in the number of kernel points O(N ). A non-exhaustive list of references include Linformers [52], Performers [53], Nyströformers [54] and Fast Transformers [55].…”
Section: Discussionmentioning
confidence: 99%
“…More recently, in the context of transformer architectures, a number of approximations have been proposed to reduce the complexity of such computations to be linear in the number of kernel points O(N ). A non-exhaustive list of references include Linformers [52], Performers [53], Nyströformers [54] and Fast Transformers [55].…”
Section: Discussionmentioning
confidence: 99%
“…Due to the quadratic computational complexity, the computation of full attention is unaffordable when dealing with long sequence tokens. Therefore, many works design efficient transformers, aiming to reduce computational complexity (Katharopoulos et al, 2020;Choromanski et al, 2020;Lee et al, 2019;Ying et al, 2018). Current efficient transformers can be categorized into three classes.…”
Section: Related Workmentioning
confidence: 99%
“…Current efficient transformers can be categorized into three classes. 1) Linear approximate attention (Katharopoulos et al, 2020;Choromanski et al, 2020;Beltagy et al, 2020;Zaheer et al, 2020) approximates the full attention matrix by linearizing the softmax attention and thus can accelerate the computation by first computing the product of keys and values. 2) Inducing point-based linear transformers (Lee et al, 2019;Ying et al, 2018) use learned inducing points with fixed size to compute attention with input tokens, thus can reduce the computation to linear complexity.…”
Section: Related Workmentioning
confidence: 99%
“…Besides, efficient transformers are proposed, which may reduce the time complexity of self-attention from quadratic to linear (or log-linear). For exam-ple, Linformer and Performer (Choromanski et al, 2020) leverage low-rank selfattention; Sparse Transformers (Child et al, 2019) and Big Bird (Zaheer et al, 2020) utilize sparse self-attention; Reformer introduces learnable attention patterns, and Synthesizer (Tay et al, 2021) introduces randomized attention patterns.…”
Section: Related Workmentioning
confidence: 99%