Training Vision Transformers with only 2040 Images

Cao, Yun-Hao; Hao, Yu; Wu, Jianxin

doi:10.1007/978-3-031-19806-9_13

Cited by 27 publications

(6 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The MDC dataset, characterized by an average of approximately 18 visits per patient, presented more challenges in learning singlecode level sequential information compared to the MIMIC-IV dataset, which has an average of about 2.5 visits per patient (figure 4a). This discrepancy could stem from the tendencies of transformers to learn the global dependencies and might require additional strategies to capture local patterns as well [50,51,52,53]. As a note, even if the random code swapping task for pre-training (with randomly initialized weights) within the MDC dataset never reached a performance above 50% (figure 4a, red line), it was possible to succeed with this pre-training by initializing its weights from the pre-trained transformer using the CCS method.…”

Section: Discussionmentioning

confidence: 99%

TOO-BERT: A Trajectory Order Objective BERT for self-supervised representation learning of temporal healthcare data

Amirahmadi,

Etminani,

Bjork

et al. 2024

Preprint

View full text Add to dashboard Cite

Healthcare data accumulation over time, particularly in Electronic Health Records (EHRs), plays a pivotal role by offering a vast repository of patient data with the potential to enhance patient care and predict health outcomes. While Bert-inspired models have shown promises in modeling EHR trajectories, the challenge lies in capturing intricate disease-intervention relationships over time. This study introduces TOO-BERT, enhancing MLM representations by explicitly leveraging sequential patient trajectory information at code and visit levels. TOO-BERT excels in learning frequent sequential patterns by refining the TOO self-supervised objective through two proposed methods, Conditional Code Swapping (CCS) and Conditional Visit Swapping (CVS) weighting functions. Evaluation on MIMIC-IV and Malmö Diet cohort datasets demonstrates TOO-BERT's performance in predicting Heart Failure (HF), Alzheimer's disease (AD), and Prolonged Length of Stay (PLS). Notably, TOO-BERT outperforms Bert in HF prediction, even with limited fine-tuning data. Our findings illustrate the potency of integrating TOO objectives in MLM-based models, enabling intricate EHR data relationships to be captured. Attention analysis highlights the model's capability to learn complex structural patterns.

show abstract

Section: Discussionmentioning

confidence: 99%

TOO-BERT: A Trajectory Order Objective BERT for self-supervised representation learning of temporal healthcare data

Amirahmadi,

Etminani,

Bjork

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…After the last T-Block that focuses more on spectral attention, we further enhance the spatial features using spatialspectral domain learning (SDL) module 37 , whose output is the desired deblurred multispectral image e Y = f À1 B ðY Þ. It is quite interesting to notice that numerous recent articles have proposed and successfully demonstrated the training of the Transformer with just small data [38][39][40][41] . The CODE addresses the challenge of small data learning using a completely different philosophy.…”

Section: Code-based Small-data Learning Theorymentioning

confidence: 99%

“…The CODE addresses the challenge of small data learning using a completely different philosophy. Simply speaking, typical techniques [38][39][40][41] have to force the deep network to return a good deep solution (as the final solution), while CODE just accepts the weak DE solution. CODE assumes that though the small scale of data results in such a weak solution, the solution itself still contains useful information.…”

Section: Code-based Small-data Learning Theorymentioning

confidence: 99%

Metasurface-empowered snapshot hyperspectral imaging with convex/deep (CODE) small-data learning theory

Lin,

Huang,

Lin

et al. 2023

Nat Commun

View full text Add to dashboard Cite

Hyperspectral imaging is vital for material identification but traditional systems are bulky, hindering the development of compact systems. While previous metasurfaces address volume issues, the requirements of complicated fabrication processes and significant footprint still limit their applications. This work reports a compact snapshot hyperspectral imager by incorporating the meta-optics with a small-data convex/deep (CODE) deep learning theory. Our snapshot hyperspectral imager comprises only one single multi-wavelength metasurface chip working in the visible window (500-650 nm), significantly reducing the device area. To demonstrate the high performance of our hyperspectral imager, a 4-band multispectral imaging dataset is used as the input. Through the CODE-driven imaging system, it efficiently generates an 18-band hyperspectral data cube with high fidelity using only 18 training data points. We expect the elegant integration of multi-resonant metasurfaces with small-data learning theory will enable low-profile advanced instruments for fundamental science studies and real-world applications.

show abstract

“…Training ViT on a small dataset from scratch. Very few works have investigated how to train ViT on small datasets [33][34][35][36][37][38]. Compared with CNN, ViT lacks the former's unique inductive bias, and more data are required to force ViT to learn this prior.…”

Section: Type-bmentioning

confidence: 99%

“…Chen et al [36] suppressed the noise effect caused by weak attention values and improved the performance of ViT on small datasets. Cao et al [37] proposed a method called parametric instance discrimination to construct the contrastive loss to improve the feature extraction performance of ViT trained on small datasets. Li et al [38] applied a CNN-based teacher model to guide ViT to improve the ability about capturing local information.…”

Section: Type-bmentioning

confidence: 99%

GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

Gao¹,

Xu²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

<p>Vision Transformer (ViT) has performed remarkably in various computer vision tasks. Nonetheless, affected by the massive amount of parameters, ViT usually suffers from serious overfitting problems with a relatively limited number of training samples. In addition, ViT generally demands heavy computing resources, which limit its deployment on resource-constrained devices. As a type of model-compression method, model binarization is potentially a good choice to solve the above problems. Compared with the full-precision one, the model with the binarization method replaces complex tensor multiplication with simple bit-wise binary operations and represents full-precision model parameters and activations with only 1-bit ones, which potentially solves the problem of model size and computational complexity, respectively. In this paper, we investigate a binarized ViT model. Empirically, we observe that the existing binarization technology designed for Convolutional Neural Networks (CNN) cannot migrate well to a ViT's binarization task. We also find that the decline of the accuracy of the binary ViT model is mainly due to the information loss of the \textbf{Attention} module and the \textbf{Value} vector. Therefore, we propose a novel model binarization technique, called \textbf{G}roup \textbf{S}uperposition \textbf{B}inarization (\textbf{GSB}), to deal with these issues. Furthermore, in order to further improve the performance of the binarization model, we have investigated the gradient calculation procedure in the binarization process and derived more proper gradient calculation equations for GSB to reduce the influence of gradient mismatch. Then, the knowledge distillation technique is introduced to alleviate the performance degradation caused by model binarization. Analytically, model binarization can limit the parameter’s search space during parameter updates while training a model. Therefore, the binarization process can actually play an implicit regularization role and help solve the problem of overfitting in the case of insufficient training data. Experiments on three datasets with limited numbers of training samples demonstrate that the proposed GSB model achieves state-of-the-art performance among the binary quantization schemes and exceeds its full-precision counterpart on some indicators. Code and models are available at https://github.com/IMRL/GSB-Vision-Transformer.</p>

show abstract

Training Vision Transformers with only 2040 Images

Cited by 27 publications

References 21 publications

TOO-BERT: A Trajectory Order Objective BERT for self-supervised representation learning of temporal healthcare data

TOO-BERT: A Trajectory Order Objective BERT for self-supervised representation learning of temporal healthcare data

Metasurface-empowered snapshot hyperspectral imaging with convex/deep (CODE) small-data learning theory

GSB: Group Superposition Binarization for Vision Transformer with Limited Training Samples

Contact Info

Product

Resources

About