2021
DOI: 10.48550/arxiv.2104.05704
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Escaping the Big Data Paradigm with Compact Transformers

Abstract: With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
152
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 98 publications
(153 citation statements)
references
References 40 publications
1
152
0
Order By: Relevance
“…Shuffle Swin Transformer [15] proposes shuffle multi-headed attention to augment spatial connection between windows. CCT [7] proposes a convolutional tokenizer and compact vision transformers, leading to better performance on smaller datasets training from scratch, with fewer parameters compared with ViT. TransCNN [23] also proposes a co-design of convolutions and multi-headed attention to learn hierarchical representations.…”
Section: Related Workmentioning
confidence: 99%
“…Shuffle Swin Transformer [15] proposes shuffle multi-headed attention to augment spatial connection between windows. CCT [7] proposes a convolutional tokenizer and compact vision transformers, leading to better performance on smaller datasets training from scratch, with fewer parameters compared with ViT. TransCNN [23] also proposes a co-design of convolutions and multi-headed attention to learn hierarchical representations.…”
Section: Related Workmentioning
confidence: 99%
“…We evaluate the average performance of NFM with different model architectures on CIFAR-10 [36], CIFAR-100 [36], and ImageNet [13]. We use a pre-activated residual network (ResNet) with depth 18 [29] and a compact vision transformer (ViT-lite) with 7 attention layers and 4 heads [28] on small scale tasks. For more challenging and higher dimensional tasks, we consider the performance of wide ResNet-18 [77] and ResNet-50 architectures, respectively.…”
Section: Resultsmentioning
confidence: 99%
“…LV-ViT [24] is equipped with a token labeling method. CCT [16] is constructed with convolutions and a sequence pooling strategy. These two works both significantly reduced the number of parameters.…”
Section: Vision Transformermentioning
confidence: 99%