A Comprehensive Survey of Transformers for Computer Vision

Jamil, Sonain; Piran, Md. Jalil; Kwon, Oh-Jin

doi:10.3390/drones7050287

Cited by 27 publications

(6 citation statements)

References 158 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In recent years, Transformer-based approaches have achieved remarkable results in the fields of language processing, computer vision and also forecasting [15,[24][25][26].…”

Section: Lstmmentioning

confidence: 99%

Prediction of Electricity Generation Using Onshore Wind and Solar Energy in Germany

Walczewski,

Wöhrle

2024

Energies

View full text Add to dashboard Cite

Renewable energy production is one of the most important strategies to reduce the emission of greenhouse gases. However, wind and solar energy especially depend on time-varying properties of the environment, such as weather. Hence, for the control and stabilization of electricity grids, the accurate forecasting of energy production from renewable energy sources is essential. This study provides an empirical comparison of the forecasting accuracy of electricity generation from renewable energy sources by different deep learning methods, including five different Transformer-based forecasting models based on weather data. The models are compared with the long short-term memory (LSTM) and Autoregressive Integrated Moving Average (ARIMA) models as a baseline. The accuracy of these models is evaluated across diverse forecast periods, and the impact of utilizing selected weather data versus all available data on predictive performance is investigated. Distinct performance patterns emerge among the Transformer-based models, with Autoformer and FEDformer exhibiting suboptimal results for this task, especially when utilizing a comprehensive set of weather parameters. In contrast, the Informer model demonstrates superior predictive capabilities for onshore wind power and photovoltaic (PV) power production. The Informer model consistently performs well in predicting both onshore wind and PV energy. Notably, the LSTM model outperforms all other models across various categories. This research emphasizes the significance of selectively using weather parameters for improved performance compared to employing all parameters and a time reference. We show that the suitability and performance of a prediction model can vary significantly, depending on the specific forecasting task and the data that are provided to the model.

show abstract

“…In recent years, Transformer-based approaches have achieved remarkable results in the fields of language processing, computer vision and also forecasting [15,[24][25][26].…”

Section: Lstmmentioning

confidence: 99%

Prediction of Electricity Generation Using Onshore Wind and Solar Energy in Germany

Walczewski,

Wöhrle

2024

Energies

View full text Add to dashboard Cite

show abstract

“…RDUNet [ 28 ] is a residual dense neural network for image denoising based on a densely connected hierarchical network. Recently, transformer technology has been applied to image denoising [ 29 , 30 ]. Most representatively, swin-transformer UNet for image denoising (SUNet) [ 31 ] and swin-transformer-based image restoration (SwinIR) [ 32 ] adopt the swin-transformer as the primary module and integrate it into a unique denoising architecture to suppress additive noise.…”

Section: Related Workmentioning

confidence: 99%

Multi-Branch Network for Color Image Denoising Using Dilated Convolution and Attention Mechanisms

Duong,

Nguyen Thi,

Lee

et al. 2024

Sensors

View full text Add to dashboard Cite

Image denoising is regarded as an ill-posed problem in computer vision tasks that removes additive noise from imaging sensors. Recently, several convolution neural network-based image-denoising methods have achieved remarkable advances. However, it is difficult for a simple denoising network to recover aesthetically pleasing images owing to the complexity of image content. Therefore, this study proposes a multi-branch network to improve the performance of the denoising method. First, the proposed network is designed based on a conventional autoencoder to learn multi-level contextual features from input images. Subsequently, we integrate two modules into the network, including the Pyramid Context Module (PCM) and the Residual Bottleneck Attention Module (RBAM), to extract salient information for the training process. More specifically, PCM is applied at the beginning of the network to enlarge the receptive field and successfully address the loss of global information using dilated convolution. Meanwhile, RBAM is inserted into the middle of the encoder and decoder to eliminate degraded features and reduce undesired artifacts. Finally, extensive experimental results prove the superiority of the proposed method over state-of-the-art deep-learning methods in terms of objective and subjective performances.

show abstract

“…The results show that the main applications of ViTs are as follows: 50% are for image classification, 40% are for object detection, 1% are for segmentation, 1% are for compression, 2% are for super-resolution, 3% are for denoising, and 3% are for anomaly detection [22].…”

Section: Transformer Models In Computer Visionmentioning

confidence: 99%

Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets

Bobojanov,

Kim,

Arabboev

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

Facial emotion recognition (FER) has a huge importance in the field of human–machine interface. Given the intricacies of human facial expressions and the inherent variations in images, which are characterized by diverse facial poses and lighting conditions, the task of FER remains a challenging endeavour for computer-based models. Recent advancements have seen vision transformer (ViT) models attain state-of-the-art results across various computer vision tasks, encompassing image classification, object detection, and segmentation. Moreover, one of the most important aspects of creating strong machine learning models is correcting data imbalances. To avoid biased predictions and guarantee reliable findings, it is essential to maintain the distribution equilibrium of the training dataset. In this work, we have chosen two widely used open-source datasets, RAF-DB and FER2013. As well as resolving the imbalance problem, we present a new, balanced dataset, applying data augmentation techniques and cleaning poor-quality images from the FER2013 dataset. We then conduct a comprehensive evaluation of thirteen different ViT models with these three datasets. Our investigation concludes that ViT models present a promising approach for FER tasks. Among these ViT models, Mobile ViT and Tokens-to-Token ViT models appear to be the most effective, followed by PiT and Cross Former models.

show abstract

A Comprehensive Survey of Transformers for Computer Vision

Cited by 27 publications

References 158 publications

Prediction of Electricity Generation Using Onshore Wind and Solar Energy in Germany

Prediction of Electricity Generation Using Onshore Wind and Solar Energy in Germany

Multi-Branch Network for Color Image Denoising Using Dilated Convolution and Attention Mechanisms

Comparative Analysis of Vision Transformer Models for Facial Emotion Recognition Using Augmented Balanced Datasets

Contact Info

Product

Resources

About