Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Wu, Sitong; Wu, Tianyi; Tan, Haoru; Guo, Guodong

doi:10.48550/arxiv.2112.14000

Cited by 3 publications

(5 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, Transformer [50] has attracted the attention of computer vision community due to its success in the field of natural language processing. A series of Transformer-based methods [13,27,56,51,36,18,12,6,57,60,25,42] have been developed for high-level vision tasks, including image classification [36,13,27,44,49], object detection [34,48,36,4,6], segmentation [55,51,16,2], etc. Although vision Transformer has shown its superiority on modeling long-range dependency [13,43], there are still many works demonstrating that the convolution can help Transformer achieve better visual representation [56,58,61,60,25].…”

Section: Vision Transformermentioning

confidence: 99%

“…These artifacts are caused by the window partition mechanism, and this phenomenon suggests that the shifted window mechanism is inefficient to build the cross-window connection. Some works for high-level vision tasks [12,18,57,42] also point out that enhancing the connection among windows can improve the window-based self-attention methods. Based on the above two points, we investigate channel attention in the Transformer-based model and propose an overlapping cross-attention module to better aggregate cross-window information for the window-based SR Transformer.…”

Section: Motivationmentioning

confidence: 99%

See 1 more Smart Citation

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines channel attention and self-attention schemes, thus making use of their complementary advantages. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally propose a same-task pre-training strategy to bring further improvement. Extensive experiments show the effectiveness of the proposed modules, and the overall method significantly outperforms the state-of-the-art methods by more than 1dB. Codes and models will be available at https://github.com/chxy95/HAT.

show abstract

Section: Vision Transformermentioning

confidence: 99%

Section: Motivationmentioning

confidence: 99%

Activating More Pixels in Image Super-Resolution Transformer

Chen¹,

Wang²,

Zhou³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Swin Transformer [19] restricts attention to local windows and enhances inter-window information interaction through shift transform, resulting in an efficiency and performance tradeoff. To further expand the receptive field of window attention and improve the performance of the model in dense prediction tasks, CSWin [7] and Pale Transformer [30] use bar window attention for spatial information aggregation.…”

Section: B Spatial Token Mixermentioning

confidence: 99%

“…Most works have focused on elaborating the spatial token mixer for further improvements. Some of them put their efforts into a well-designed attention mechanism by cross-window connection [14], [19], axial window [7], [30], dynamic window [18], [31]. In contrast, convolutional token mixers also gained much success via large kernel [23] and deformable kernel designs [27].…”

Section: Introductionmentioning

confidence: 99%

Block-Wisely Supervised Neural Architecture Search With Knowledge Distillation

Peng

Yuan³

et al. 2020

2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

153

119

View full text Add to dashboard Cite

Vision Transformers have shown great potential in computer vision tasks. Most recent works have focused on elaborating the spatial token mixer for performance gains. However, we observe that a well-designed general architecture can significantly improve the performance of the entire backbone, regardless of which spatial token mixer is equipped. In this paper, we propose UniNeXt, an improved general architecture for the vision backbone. To verify its effectiveness, we instantiate the spatial token mixer with various typical and modern designs, including both convolution and attention modules. Compared with the architecture in which they are first proposed, our UniNeXt architecture can steadily boost the performance of all the spatial token mixers, and narrows the performance gap among them. Surprisingly, our UniNeXt equipped with naive local window attention even outperforms the previous state-ofthe-art. Interestingly, the ranking of these spatial token mixers also changes under our UniNeXt, suggesting that an excellent spatial token mixer may be stifled due to a suboptimal general architecture, which further shows the importance of the study on the general architecture of vision backbone. All models and codes will be publicly available.

show abstract

“…Swin Transformer (Liu et al 2021) limited the computation of self-attention to local windows and constructed cross-window connections between two successive blocks. CSwin (Dong et al 2021) and Pale Transformer (Wu et al 2021b) designed cross-shaped windows and Pale-shaped windows respectively. Shuffle Transformer (Huang et al 2021) proposed shuffled windows.…”

Section: Introductionmentioning

confidence: 99%

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Hechen¹,

Huang²,

Zhao³

2022

Preprint

View full text Add to dashboard Cite

Transformers have demonstrated a competitive performance across a wide range of vision tasks, while it is very expensive to compute the global self-attention. Many methods limit the range of attention within a local window to reduce computation complexity. However, their approaches cannot save the number of parameters; meanwhile, the selfattention and inner position bias (inside the softmax function) cause each query to focus on similar and close patches. Consequently, this paper presents a light self-limited-attention (LSLA) consisting of a light self-attention mechanism (LSA) to save the computation cost and the number of parameters, and a self-limited-attention mechanism (SLA) to improve the performance. Firstly, the LSA replaces the K (Key) and V (Value) of self-attention with the X(origin input). Applying it in vision Transformers which have encoder architecture and self-attention mechanism, can simplify the computation. Secondly, the SLA has a positional information module and a limited-attention module. The former contains a dynamic scale and an inner position bias to adjust the distribution of the self-attention scores and enhance the positional information. The latter uses an outer position bias after the softmax function to limit some large values of attention weights. Finally, a hierarchical Vision Transformer with Light self-Limited-attention (ViT-LSLA) is presented. The experiments show that ViT-LSLA achieves 71.6% top-1 accuracy on IP102 (2.4% absolute improvement of Swin-T); 87.2% top-1 accuracy on Mini-ImageNet (3.7% absolute improvement of Swin-T). Furthermore, it greatly reduces FLOPs (3.5GFLOPs vs. 4.5GFLOPs of Swin-T) and parameters (18.9M vs. 27.6M of Swin-T).

show abstract

Pale Transformer: A General Vision Transformer Backbone with Pale-Shaped Attention

Cited by 3 publications

References 0 publications

Activating More Pixels in Image Super-Resolution Transformer

Activating More Pixels in Image Super-Resolution Transformer

Block-Wisely Supervised Neural Architecture Search With Knowledge Distillation

ViT-LSLA: Vision Transformer with Light Self-Limited-Attention

Contact Info

Product

Resources

About