2022
DOI: 10.1609/aaai.v36i3.20176
|View full text |Cite
|
Sign up to set email alerts
|

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Abstract: Recently, Transformers have shown promising performance in various vision tasks. To reduce the quadratic computation complexity caused by the global self-attention, various methods constrain the range of attention within a local region to improve its efficiency. Consequently, their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. To address this issue, we propose a Pale-Shaped self-Attention (PS-Attention), which performs self-attention within a pal… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 40 publications
(9 citation statements)
references
References 30 publications
0
7
0
Order By: Relevance
“…Our models are highlighted in gray. Although some recent work [33,37,55] with hybrid architectures or more careful designs on model width and depth achieve better performance than HorNet on ImageNet-1K, we think our models will also benefit from these techniques and achieve better performance. Our models also generalize well to a larger image resolution, larger model sizes and more training data.…”
Section: Imagenet Classificationmentioning
confidence: 92%
“…Our models are highlighted in gray. Although some recent work [33,37,55] with hybrid architectures or more careful designs on model width and depth achieve better performance than HorNet on ImageNet-1K, we think our models will also benefit from these techniques and achieve better performance. Our models also generalize well to a larger image resolution, larger model sizes and more training data.…”
Section: Imagenet Classificationmentioning
confidence: 92%
“…Srinivas et al [27] proposed BoTNet, a transformer model incorporating the backbone architecture of selfattention and applied it to multiple computer vision tasks including image classification, object detection, and instance segmentation. Wu et al [28] proposed Pale Transformer with pale self-attention (PS-Attention) that performs self-attention within pale regions. Compared with global self-attention, PS-Attention can significantly reduce computational and memory costs while capturing richer contextual information.…”
Section: Related Workmentioning
confidence: 99%
“…Some other methods try to alleviate the computations of self-attention by reducing the number of tokens. One way is to enforce the computation of self-attention to be conducted in a predefined local region, for instance, Swin Transformer [23], Pale Transformer [33], HaloNet [30], and CSWin Transformer [6]. These methods are based on the assumption that image patches located far from each other are not semantically relevant, but this only partially holds true.…”
Section: Related Workmentioning
confidence: 99%
“…Accordingly, the recent efforts were devoted to the following trials: (1) Enforce the self-attention to be confined in a neighborhood around each token such that fewer tokens will be involved in updating each token. The methods falling in this category include Swin Transformer [23], Pale Transformer [33], HaloNet [30], and CSWin Transformer [6]. These methods are based on such an assumption that tokens spatially far away are not semantically correlated, but this does not always hold true.…”
Section: Introductionmentioning
confidence: 99%