2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00060
|View full text |Cite
|
Sign up to set email alerts
|

Tokens-to-Token ViT: Training Vision Transformers from Scratch on ImageNet

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
386
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 1,529 publications
(479 citation statements)
references
References 21 publications
0
386
0
Order By: Relevance
“…non-overlapped). However, as pointed out by prior works like T2T [51], CrossViT [5], and PiT [19], using a stronger but still simple subnetwork for the tokenization could further improve the performance, especially for the smaller models. Thus, we adopt one convolutional layer with overlapped kernel and identical stride when generating local tokens, e.g.…”
Section: Rsa Regional Tokensmentioning
confidence: 99%
See 2 more Smart Citations
“…non-overlapped). However, as pointed out by prior works like T2T [51], CrossViT [5], and PiT [19], using a stronger but still simple subnetwork for the tokenization could further improve the performance, especially for the smaller models. Thus, we adopt one convolutional layer with overlapped kernel and identical stride when generating local tokens, e.g.…”
Section: Rsa Regional Tokensmentioning
confidence: 99%
“…Motivated by this, two main line of research works have been developed to improve ViT. One is to enhance different components of the vision transformer [51,16,41,23] while still using isotropic structure (i.e., fixed token numbers and channel dimension) like ViT, e.g. while T2T-ViT [51] introduce an Tokens-to-Token (T2T) transformation to encode the important local structure for each token instead of the naive tokenization, CrossViT propose a dual-path architecture, each with different scales, to learn multi-scale feature.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Furthermore, knowledge distillation Hinton et al (2015) ; Wei et al (2020) from the model based on CNN is shown to be effective in improving the performance of the vision transformer Touvron et al (2020) . Instead of simply regarding image patches as tokens, Yuan et al proposed a tokens-to-token (T2T) method to better tokenize patches with the consideration of image structure Yuan et al (2021) . The T2T method achieves better accuracy using fewer parameters compared with the vanilla vision transformer Dosovitskiy et al (2020) .…”
Section: Related Workmentioning
confidence: 99%
“…Recently, the Transformer neural network architecture was proposed [19] for sequential data, with great success in natural language processing [14,4], and more recently, vision [5,3,6,21,18]. The attention mechanism is at the core of Transformer, which is readily capable of learning long-range dependencies between any two positions in the input data in the form of an attention map.…”
Section: Learning Architectures For Visual Tasksmentioning
confidence: 99%