2022
DOI: 10.1007/978-3-031-20053-3_7
|View full text |Cite
|
Sign up to set email alerts
|

Locality Guidance for Improving Vision Transformers on Tiny Datasets

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
21
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 33 publications
(21 citation statements)
references
References 36 publications
0
21
0
Order By: Relevance
“…After experiments, we found that when training Multilayer Perceptron (MLP) as a Feed‐Forward Network, underwater images achieve some results, but still not to our satisfaction. Inspired by [50], we modify the MLP module to Locally‐Enhanced Feed‐Forward (LeFF) module, as shown in Figure 4. We first increase the feature dimension of each token using a 1 × 1 convolution.…”
Section: Methodsmentioning
confidence: 99%
“…After experiments, we found that when training Multilayer Perceptron (MLP) as a Feed‐Forward Network, underwater images achieve some results, but still not to our satisfaction. Inspired by [50], we modify the MLP module to Locally‐Enhanced Feed‐Forward (LeFF) module, as shown in Figure 4. We first increase the feature dimension of each token using a 1 × 1 convolution.…”
Section: Methodsmentioning
confidence: 99%
“…As a result of this issue, these models are in need of a large amount of sample data set and quadratic computational cost. In addition, the receptive eld of the ViT models are constant and capture the global dependencies of the input data set [42,43].…”
Section: Crop-netmentioning
confidence: 99%
“…Training ViT on a small dataset from scratch. Very few works have investigated how to train ViT on small datasets [33][34][35][36][37][38]. Compared with CNN, ViT lacks the former's unique inductive bias, and more data are required to force ViT to learn this prior.…”
Section: Type-bmentioning
confidence: 99%
“…Cao et al [37] proposed a method called parametric instance discrimination to construct the contrastive loss to improve the feature extraction performance of ViT trained on small datasets. Li et al [38] applied a CNN-based teacher model to guide ViT to improve the ability about capturing local information.…”
Section: Type-bmentioning
confidence: 99%