2023
DOI: 10.3390/app13095521
|View full text |Cite
|
Sign up to set email alerts
|

Comparing Vision Transformers and Convolutional Neural Networks for Image Classification: A Literature Review

Abstract: Transformers are models that implement a mechanism of self-attention, individually weighting the importance of each part of the input data. Their use in image classification tasks is still somewhat limited since researchers have so far chosen Convolutional Neural Networks for image classification and transformers were more targeted to Natural Language Processing (NLP) tasks. Therefore, this paper presents a literature review that shows the differences between Vision Transformers (ViT) and Convolutional Neural … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
31
0
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 143 publications
(32 citation statements)
references
References 29 publications
0
31
0
1
Order By: Relevance
“…The deep learning feature extraction is accomplished using a vision transformer based model (ViT-Base) [14]. Different from conventional neural networks, vision transformer utilizes the self-attention mechanism to focus on the most important regions of the target image, based on which the most meaningful features were computed for certain classification or prediction tasks [15]. Since our limited dataset cannot support sufficient fine-tuning or optimization of the ViT model, we directly used the pre-trained ViT model as a fixed feature extractor.…”
Section: Methodsmentioning
confidence: 99%
“…The deep learning feature extraction is accomplished using a vision transformer based model (ViT-Base) [14]. Different from conventional neural networks, vision transformer utilizes the self-attention mechanism to focus on the most important regions of the target image, based on which the most meaningful features were computed for certain classification or prediction tasks [15]. Since our limited dataset cannot support sufficient fine-tuning or optimization of the ViT model, we directly used the pre-trained ViT model as a fixed feature extractor.…”
Section: Methodsmentioning
confidence: 99%
“…On the other hand, vision transforms do not contain inductive biases. Also, the combination of CNNs and Transformers was applied to image processing [186], which contributed to reducing the consumption of computing resources and training time [187,188]. The main disadvantages of Transformers are the need for commitment of large amounts of computational resources and the requirements of the long training time.…”
Section: Neural Network and Learning Algorithms In The Medical Image ...mentioning
confidence: 99%
“…The Convolution Vision Transformer structure merges the advantages of Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Convolutional Neural Networks (CNNs) are recognised for their efficiency in processing local features through their convolutional layers, while Vision Transformers (ViTs) excel at capturing global dependencies in an image through self-attention mechanisms (Maurí cio et al, 2023). The PixelShuffle operation, also known as sub-pixel convolution, is a technique mainly used for upscaling images in super-resolution tasks (Wang et al, 2023b).…”
Section: Introductionmentioning
confidence: 99%