Efficient Content-Based Sparse Attention with Routing Transformers

Roy, Aurko; Saffar, Mohammad; Vaswani, Ashish; Grangier, David

doi:10.48550/arxiv.2003.05997

Cited by 26 publications

(42 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A number of methods have been devoted to designing efficient attention implementations. [33,7,16] use sparse matrix with strict constraints for efficient attention computation. Others [9,2,14,43] employ kernel factorization or matrix factorization to reduce the computational overhead.…”

Section: Related Workmentioning

confidence: 99%

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Yu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

General point clouds have been increasingly investigated for different tasks, and recently Transformerbased networks are proposed for point cloud analysis. However, there are barely related works for medical point clouds, which are important for disease detection and treatment. In this work, we propose an attention-based model specifically for medical point clouds, namely 3D medical point Transformer (3DMedPT), to examine the complex biological structures. By augmenting contextual information and summarizing local responses at query, our attention module can capture both local context and global content feature interactions. However, the insufficient training samples of medical data may lead to poor feature learning, so we apply position embeddings to learn accurate local geometry and Multi-Graph Reasoning (MGR) to examine global knowledge propagation over channel graphs to enrich feature representations. Experiments conducted on IntrA dataset proves the superiority of 3DMedPT, where we achieve the best classification and segmentation results. Furthermore, the promising generalization ability of our method is validated on general 3D point cloud benchmarks: ModelNet40 and ShapeNetPart. Code 1 is released.

show abstract

Section: Related Workmentioning

confidence: 99%

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Yu¹,

Zhang²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Sparse Attention A well-known approach addressing the memory bottleneck is utilizing sparsity patterns in the attention matrix -Routing (Roy et al 2020) and Sparse Transformer (Child et al 2019) are examples of such methods. Our solution is different in the sense that it uses full attention -just with shortened sequence length.…”

Section: Related Workmentioning

confidence: 99%

“…Due to this limitation, vanilla transformers are infeasible to train on tasks with very long input sequences, for instance on highresolution images. This issue has been studied extensively and a number of techniques were introduced that modify attention mechanism without changing overall transformer architecture (Child et al 2019;Roy et al 2020;Ren et al 2021). These sparse attention mechanisms reduce the complexity of self-attention, but still force the model to operate on the sequence of the same length as the input.…”

Section: Introductionmentioning

confidence: 99%

Hierarchical Transformers Are More Efficient Language Models

Nawrot¹,

Tworkowski²,

Tyrolski³

et al. 2021

Preprint

View full text Add to dashboard Cite

Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs: full paragraphs produced by GPT-3 or wellstructured images produced by DALL-E. These large language models are impressive but also very inefficient and costly, which limits their applications and accessibility. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences. To verify this claim, we first study different ways to downsample and upsample activations in Transformers so as to make them hierarchical. We use the best performing upsampling and downsampling layers to create Hourglassa hierarchical Transformer language model. Hourglass improves upon the Transformer baseline given the same amount of computation and can yield the same results as Transformers more efficiently. In particular, Hourglass sets new stateof-the-art for Transformer models on the ImageNet32 generation task and improves language modeling efficiency on the widely studied enwik8 benchmark. * Equal contribution. Order determined by coin toss.Preprint. Under review.

show abstract

“…Long Document Reasoning In real-world scenarios, the question answering system usually needs to read long documents to find the answer. Many transformer variants to resolve the O(n 2 ) attention cost have been proposed including Sparse Attention [Child et al, 2019], Reformer [Kitaev et al, 2020], Routing Transformer [Roy et al, 2020], Longformer [Beltagy et al, 2020], ETC and BigBird [Zaheer et al, 2020]. In the recent long-range arena [Tay et al, 2020], BigBird is reported to achieve the best score among the different variants, which motivates us to use BigBird as our extractive baseline.…”

Section: Related Workmentioning

confidence: 99%

A Dataset for Answering Time-Sensitive Questions

Chen¹,

Wang²,

Wang³

2021

Preprint

View full text Add to dashboard Cite

Time is an important dimension in our physical world. Lots of facts can evolve with respect to time. For example, the U.S. President might change every four years. Therefore, it is important to consider the time dimension and empower the existing QA models to reason over time. However, the existing QA datasets contain rather few time-sensitive questions, hence not suitable for diagnosing or benchmarking the model's temporal reasoning capability. In order to promote research in this direction, we propose to construct a time-sensitive QA dataset. The dataset is constructed by 1) mining time-evolving facts from WikiData and align them to their corresponding Wikipedia page, 2) employing crowd workers to verify and calibrate these noisy facts, 3) generating question-answer pairs based on the annotated time-sensitive facts. Our dataset poses challenges in the aspect of both temporal understanding and temporal reasoning. We evaluate different SoTA long-document QA systems like BigBird and FiD on our dataset. The bestperforming model FiD can only achieve 46% accuracy, still far behind the human performance of 87%. We demonstrate that these models are still lacking the ability to perform consistent temporal reasoning. Therefore, we believe that our dataset could serve as a benchmark to develop NLP models more sensitive to temporal shift. The dataset and code are released in https://github.com/wenhuchen/ Time-Sensitive-QA.

show abstract

Efficient Content-Based Sparse Attention with Routing Transformers

Cited by 26 publications

References 0 publications

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

3D Medical Point Transformer: Introducing Convolution to Attention Networks for Medical Point Cloud Analysis

Hierarchical Transformers Are More Efficient Language Models

A Dataset for Answering Time-Sensitive Questions

Contact Info

Product

Resources

About