2021
DOI: 10.48550/arxiv.2112.14000
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”
Section: Vision Transformermentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”
Section: Vision Transformermentioning
confidence: 99%
“…These artifacts are caused by the window partition mechanism, and this phenomenon suggests that the shifted window mechanism is inefficient to build the cross-window connection. Some works for high-level vision tasks [12,18,57,42] also point out that enhancing the connection among windows can improve the window-based self-attention methods. Based on the above two points, we investigate channel attention in the Transformer-based model and propose an overlapping cross-attention module to better aggregate cross-window information for the window-based SR Transformer.…”
Section: Motivationmentioning
confidence: 99%
“…Swin Transformer [19] restricts attention to local windows and enhances inter-window information interaction through shift transform, resulting in an efficiency and performance tradeoff. To further expand the receptive field of window attention and improve the performance of the model in dense prediction tasks, CSWin [7] and Pale Transformer [30] use bar window attention for spatial information aggregation.…”
Section: B Spatial Token Mixermentioning
confidence: 99%
“…Most works have focused on elaborating the spatial token mixer for further improvements. Some of them put their efforts into a well-designed attention mechanism by cross-window connection [14], [19], axial window [7], [30], dynamic window [18], [31]. In contrast, convolutional token mixers also gained much success via large kernel [23] and deformable kernel designs [27].…”
Section: Introductionmentioning
confidence: 99%
“…Swin Transformer (Liu et al 2021) limited the computation of self-attention to local windows and constructed cross-window connections between two successive blocks. CSwin (Dong et al 2021) and Pale Transformer (Wu et al 2021b) designed cross-shaped windows and Pale-shaped windows respectively. Shuffle Transformer (Huang et al 2021) proposed shuffled windows.…”
Section: Introductionmentioning
confidence: 99%