2021
DOI: 10.48550/arxiv.2109.03810
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scaled ReLU Matters for Training Vision Transformers

Abstract: Vision transformers (ViTs) have been an alternative design paradigm to convolutional neural networks (CNNs). However, the training of ViTs is much harder than CNNs, as it is sensitive to the training parameters, such as learning rate, optimizer and warmup epoch. The reasons for training difficulty are empirically analysed in (Xiao et al. 2021), and the authors conjecture that the issue lies with the patchify-stem of ViT models and propose that early convolutions help transformers see better. In this paper, we … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2021
2021

Publication Types

Select...
3

Relationship

2
1

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 48 publications
(62 reference statements)
0
5
0
Order By: Relevance
“…Li et al [82] also demonstrate that the first few layers embed local details. Xiao et al [75] and Wang et al [68] find that introducing inductive bias, like convolution stem, can stabilize the training and improve the peak performance of ViTs. Similarly, Dai et al [17] marry convolution with ViTs, improving the model generalization ability.…”
Section: Related Workmentioning
confidence: 99%
“…Li et al [82] also demonstrate that the first few layers embed local details. Xiao et al [75] and Wang et al [68] find that introducing inductive bias, like convolution stem, can stabilize the training and improve the peak performance of ViTs. Similarly, Dai et al [17] marry convolution with ViTs, improving the model generalization ability.…”
Section: Related Workmentioning
confidence: 99%
“…1) The patchity stem implemented by a stridep p × p convolution (p = 16 by default) in the standard ViT is the key reason of training instability [8]. Recent works show the convolution stem [40,42] improve training stability and peak performance. 2) Data bias is a critical challenge for person ReID.…”
Section: Ibn-based Convolution Stemmentioning
confidence: 99%
“…From the perspective of model structure, some recent works [8,40,42] have pointed out that, an important factor that affects performance and stability of ViTs is the patchify stem implemented by a stride-p p × p convolution (p = 16 by default) on the input image. To address this problem, MocoV3 [8] froze the patch projection to train ViTs, while Xiao et al [42] and Wang et al [40] proposed a convolution stem stacked by several convolution, Batch Normalization (BN) [22], and ReLU [32] layers to increase optimization stability and improve performance. Inspired by the success of integrating Instance Normalization (IN) and BN to learn domain-invariant representation in the ReID task [11,30,33], we refer to IBN-Net [33] and improve the convolution stem to the IBN-based convolution stem (ICS).…”
Section: Introductionmentioning
confidence: 99%
“…ViT directly splits the input image into 16 × 16 non-overlap patches. A recent study [23] finds using convolution in the patch embedding provides a higher-quality token sequence and helps transformer "see better" than a conventional large-stride non-overlapping patch embedding. Therefore, some works [14,26] conduct overlapped patch embedding like using a 7 × 7 convolution.…”
Section: Patch Embeddingmentioning
confidence: 99%
“…Patch Embedding Many recent works [9,10,24] study the function of the image to token mapping, i.e. the patch embedding head.…”
Section: Ablation Studiesmentioning
confidence: 99%