Scaled ReLU Matters for Training Vision Transformers

Wang, Pichao; Wang, Xue; Luo, Hao; Zhou, Jingkai; Zhou, Zhipeng; Wang, Fan; Li, Hao; Jin, Rong

doi:10.48550/arxiv.2109.03810

Cited by 3 publications

(5 citation statements)

References 48 publications

(62 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Li et al [82] also demonstrate that the first few layers embed local details. Xiao et al [75] and Wang et al [68] find that introducing inductive bias, like convolution stem, can stabilize the training and improve the peak performance of ViTs. Similarly, Dai et al [17] marry convolution with ViTs, improving the model generalization ability.…”

Section: Related Workmentioning

confidence: 99%

ELSA: Enhanced Local Self-Attention for Vision Transformer

Zhou¹,

Wang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Self-attention is powerful in modeling long-range dependencies, but it is weak in local finer-level feature learning. The performance of local self-attention (LSA) is just on par with convolution and inferior to dynamic filters, which puzzles researchers on whether to use LSA or its counterparts, which one is better, and what makes LSA mediocre. To clarify these, we comprehensively investigate LSA and its counterparts from two sides: channel setting and spatial processing. We find that the devil lies in the generation and application of spatial attention, where relative position embeddings and the neighboring filter application are key factors. Based on these findings, we propose the enhanced local self-attention (ELSA) with Hadamard attention and the ghost head. Hadamard attention introduces the Hadamard product to efficiently generate attention in the neighboring case, while maintaining the high-order mapping. The ghost head combines attention maps with static matrices to increase channel capacity. Experiments demonstrate the effectiveness of ELSA. Without architecture / hyperparameter modification, drop-in replacing LSA with ELSA boosts Swin Transformer [48] by up to +1.4 on top-1 accuracy. ELSA also consistently benefits VOLO [83] from D1 to D5, where ELSA-VOLO-D5 achieves 87.2 on the ImageNet-1K without extra training images. In addition, we evaluate ELSA in downstream tasks. ELSA significantly improves the baseline by up to +1.9 box Ap / +1.3 mask Ap on the COCO, and by up to +1.9 mIoU on the ADE20K. Code is available at https://github.com/damo-cv/ELSA.

show abstract

Section: Related Workmentioning

confidence: 99%

ELSA: Enhanced Local Self-Attention for Vision Transformer

Zhou¹,

Wang²,

Wang³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…1) The patchity stem implemented by a stridep p × p convolution (p = 16 by default) in the standard ViT is the key reason of training instability [8]. Recent works show the convolution stem [40,42] improve training stability and peak performance. 2) Data bias is a critical challenge for person ReID.…”

Section: Ibn-based Convolution Stemmentioning

confidence: 99%

“…From the perspective of model structure, some recent works [8,40,42] have pointed out that, an important factor that affects performance and stability of ViTs is the patchify stem implemented by a stride-p p × p convolution (p = 16 by default) on the input image. To address this problem, MocoV3 [8] froze the patch projection to train ViTs, while Xiao et al [42] and Wang et al [40] proposed a convolution stem stacked by several convolution, Batch Normalization (BN) [22], and ReLU [32] layers to increase optimization stability and improve performance. Inspired by the success of integrating Instance Normalization (IN) and BN to learn domain-invariant representation in the ReID task [11,30,33], we refer to IBN-Net [33] and improve the convolution stem to the IBN-based convolution stem (ICS).…”

Section: Introductionmentioning

confidence: 99%

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Luo¹,

Wang²,

Xu³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformer-based supervised pre-training achieves great performance in person re-identification (ReID). However, due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset (e.g. ImageNet-21K) to boost the performance because of the strong data fitting ability of the transformer. To address this challenge, this work targets to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure, respectively. We first investigate self-supervised learning (SSL) methods with Vision Transformer (ViT) pretrained on unlabelled person images (the LUPerson dataset), and empirically find it significantly surpasses ImageNet supervised pre-training models on ReID tasks. To further reduce the domain gap and accelerate the pre-training, the Catastrophic Forgetting Score (CFS) is proposed to evaluate the gap between pre-training and fine-tuning data. Based on CFS, a subset is selected via sampling relevant data close to the down-stream ReID data and filtering irrelevant data from the pre-training dataset. For the model structure, a ReID-specific module named IBN-based convolution stem (ICS) is proposed to bridge the domain gap by learning more invariant features. Extensive experiments have been conducted to fine-tune the pre-training models under supervised learning, unsupervised domain adaptation (UDA), and unsupervised learning (USL) settings. We successfully downscale the LUPerson dataset to 50% with no performance degradation. Finally, we achieve state-of-the-art performance on Market-1501 and MSMT17. For example, our ViT-S/16 achieves 91.3%/89.9%/89.6% mAP accuracy on Market1501 for supervised/UDA/USL ReID. Codes and models will be released to https://github.com/michuanhaohao/ TransReID-SSL.

show abstract

“…ViT directly splits the input image into 16 × 16 non-overlap patches. A recent study [23] finds using convolution in the patch embedding provides a higher-quality token sequence and helps transformer "see better" than a conventional large-stride non-overlapping patch embedding. Therefore, some works [14,26] conduct overlapped patch embedding like using a 7 × 7 convolution.…”

Section: Patch Embeddingmentioning

confidence: 99%

“…Patch Embedding Many recent works [9,10,24] study the function of the image to token mapping, i.e. the patch embedding head.…”

Section: Ablation Studiesmentioning

confidence: 99%

Shunted Self-Attention via Multi-Scale Token Aggregation

Ren¹,

Zhou²,

He³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling longrange dependencies of image patches or tokens via selfattention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with multiple objects of different scales. To address this issue, we propose a novel and generic strategy, termed shunted self-attention (SSA), that allows ViTs to model the attentions at hybrid scales per attention layer. The key idea of SSA is to inject heterogeneous receptive field sizes into tokens: before computing the self-attention matrix, it selectively merges tokens to represent larger object features while keeping certain tokens to preserve fine-grained features. This novel merging scheme enables the self-attention to learn relationships between objects with different sizes and simultaneously reduces the token numbers and the computational cost. Extensive experiments across various tasks demonstrate the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet with only half of the model size and computation cost, and surpasses Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar parameter and computation cost. Code has been released at https://github.com/OliverRensu/Shunted-Transformer.

show abstract

Scaled ReLU Matters for Training Vision Transformers

Cited by 3 publications

References 48 publications

ELSA: Enhanced Local Self-Attention for Vision Transformer

ELSA: Enhanced Local Self-Attention for Vision Transformer

Self-Supervised Pre-Training for Transformer-Based Person Re-Identification

Shunted Self-Attention via Multi-Scale Token Aggregation

Contact Info

Product

Resources

About