2022
DOI: 10.1007/978-3-031-19806-9_13
|View full text |Cite
|
Sign up to set email alerts
|

Training Vision Transformers with only 2040 Images

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 27 publications
(6 citation statements)
references
References 21 publications
0
6
0
Order By: Relevance
“…The MDC dataset, characterized by an average of approximately 18 visits per patient, presented more challenges in learning singlecode level sequential information compared to the MIMIC-IV dataset, which has an average of about 2.5 visits per patient (figure 4a). This discrepancy could stem from the tendencies of transformers to learn the global dependencies and might require additional strategies to capture local patterns as well [50,51,52,53]. As a note, even if the random code swapping task for pre-training (with randomly initialized weights) within the MDC dataset never reached a performance above 50% (figure 4a, red line), it was possible to succeed with this pre-training by initializing its weights from the pre-trained transformer using the CCS method.…”
Section: Discussionmentioning
confidence: 99%
“…The MDC dataset, characterized by an average of approximately 18 visits per patient, presented more challenges in learning singlecode level sequential information compared to the MIMIC-IV dataset, which has an average of about 2.5 visits per patient (figure 4a). This discrepancy could stem from the tendencies of transformers to learn the global dependencies and might require additional strategies to capture local patterns as well [50,51,52,53]. As a note, even if the random code swapping task for pre-training (with randomly initialized weights) within the MDC dataset never reached a performance above 50% (figure 4a, red line), it was possible to succeed with this pre-training by initializing its weights from the pre-trained transformer using the CCS method.…”
Section: Discussionmentioning
confidence: 99%
“…After the last T-Block that focuses more on spectral attention, we further enhance the spatial features using spatialspectral domain learning (SDL) module 37 , whose output is the desired deblurred multispectral image e Y = f À1 B ðY Þ. It is quite interesting to notice that numerous recent articles have proposed and successfully demonstrated the training of the Transformer with just small data [38][39][40][41] . The CODE addresses the challenge of small data learning using a completely different philosophy.…”
Section: Code-based Small-data Learning Theorymentioning
confidence: 99%
“…The CODE addresses the challenge of small data learning using a completely different philosophy. Simply speaking, typical techniques [38][39][40][41] have to force the deep network to return a good deep solution (as the final solution), while CODE just accepts the weak DE solution. CODE assumes that though the small scale of data results in such a weak solution, the solution itself still contains useful information.…”
Section: Code-based Small-data Learning Theorymentioning
confidence: 99%
“…Training ViT on a small dataset from scratch. Very few works have investigated how to train ViT on small datasets [33][34][35][36][37][38]. Compared with CNN, ViT lacks the former's unique inductive bias, and more data are required to force ViT to learn this prior.…”
Section: Type-bmentioning
confidence: 99%
“…Chen et al [36] suppressed the noise effect caused by weak attention values and improved the performance of ViT on small datasets. Cao et al [37] proposed a method called parametric instance discrimination to construct the contrastive loss to improve the feature extraction performance of ViT trained on small datasets. Li et al [38] applied a CNN-based teacher model to guide ViT to improve the ability about capturing local information.…”
Section: Type-bmentioning
confidence: 99%