2021
DOI: 10.48550/arxiv.2106.10270
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers

Abstract: Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation ("AugReg" for short) when training on smaller training datasets. We conduct a systematic empirical study in order to bett… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

4
116
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 82 publications
(121 citation statements)
references
References 22 publications
4
116
0
1
Order By: Relevance
“…For the backbone of each view, we consider five ViT variants, "Tiny", "Small", "Base", "Large", and "Huge". Their settings strictly follow the ones defined in BERT [17] and ViT [18,60], i.e. number of transformer layers, number of attention heads, hidden dimensions.…”
Section: Methodsmentioning
confidence: 99%
See 2 more Smart Citations
“…For the backbone of each view, we consider five ViT variants, "Tiny", "Small", "Base", "Large", and "Huge". Their settings strictly follow the ones defined in BERT [17] and ViT [18,60], i.e. number of transformer layers, number of attention heads, hidden dimensions.…”
Section: Methodsmentioning
confidence: 99%
“…All model variants use the same global encoder which follows the "Base" architecture, except that the number of heads is set to 8 instead of 12. The reason is that the hidden dimension of the tokens should be divisible by the number of heads for multi-head attention, and the number of hidden dimensions across all standard transformer architectures (from "Tiny" to "Huge" [18,59]) is divisible by 8.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…Although ViTs have attained competitive performance on vision tasks, they are known to be more difficult to train than CNNs (Steiner et al, 2021;. In ViTs, only multi-layer perceptron (MLP) layers operate locally and are translationally equivariant, while the self-attention layers (Vaswani et al, 2017) operate globally (Dosovitskiy et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
“…In ViTs, only multi-layer perceptron (MLP) layers operate locally and are translationally equivariant, while the self-attention layers (Vaswani et al, 2017) operate globally (Dosovitskiy et al, 2021). As such, ViTs are thought to have weaker inductive biases than CNNs, thus requiring more data, augmentations, and/or regularization than training similarly-sized CNNs Steiner et al, 2021;. However, the strategies for data augmentation for training ViTs have largely been adapted from the techniques used for CNNs.…”
Section: Introductionmentioning
confidence: 99%