2020
DOI: 10.48550/arxiv.2003.05997
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Efficient Content-Based Sparse Attention with Routing Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
42
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 26 publications
(42 citation statements)
references
References 0 publications
0
42
0
Order By: Relevance
“…A number of methods have been devoted to designing efficient attention implementations. [33,7,16] use sparse matrix with strict constraints for efficient attention computation. Others [9,2,14,43] employ kernel factorization or matrix factorization to reduce the computational overhead.…”
Section: Related Workmentioning
confidence: 99%
“…A number of methods have been devoted to designing efficient attention implementations. [33,7,16] use sparse matrix with strict constraints for efficient attention computation. Others [9,2,14,43] employ kernel factorization or matrix factorization to reduce the computational overhead.…”
Section: Related Workmentioning
confidence: 99%
“…Sparse Attention A well-known approach addressing the memory bottleneck is utilizing sparsity patterns in the attention matrix -Routing (Roy et al 2020) and Sparse Transformer (Child et al 2019) are examples of such methods. Our solution is different in the sense that it uses full attention -just with shortened sequence length.…”
Section: Related Workmentioning
confidence: 99%
“…Due to this limitation, vanilla transformers are infeasible to train on tasks with very long input sequences, for instance on highresolution images. This issue has been studied extensively and a number of techniques were introduced that modify attention mechanism without changing overall transformer architecture (Child et al 2019;Roy et al 2020;Ren et al 2021). These sparse attention mechanisms reduce the complexity of self-attention, but still force the model to operate on the sequence of the same length as the input.…”
Section: Introductionmentioning
confidence: 99%
“…Long Document Reasoning In real-world scenarios, the question answering system usually needs to read long documents to find the answer. Many transformer variants to resolve the O(n 2 ) attention cost have been proposed including Sparse Attention [Child et al, 2019], Reformer [Kitaev et al, 2020], Routing Transformer [Roy et al, 2020], Longformer [Beltagy et al, 2020], ETC and BigBird [Zaheer et al, 2020]. In the recent long-range arena [Tay et al, 2020], BigBird is reported to achieve the best score among the different variants, which motivates us to use BigBird as our extractive baseline.…”
Section: Related Workmentioning
confidence: 99%