Transformers in Vision: A Survey

Khan, Salman; Naseer, Muzammal; Hayat, Munawar; Zamir, Syed Waqas; Khan, Fahad Shahbaz; Shah, Mubarak

doi:10.1145/3505244

Cited by 1,566 publications

(622 citation statements)

References 82 publications

Supporting

Mentioning

612

Contrasting

Unclassified

Order By: Relevance

“…At this stage, we can conclude that using Resnet as deep learning backbone does not outperform humans in feature selection. This, however, can possibly be improved by using more sophisticated network architectures and pipelines such transformers 48 .…”

Section: Discussionmentioning

confidence: 99%

Automated Recognition Of Pain In Cats

Feighelstein

Shimshoni

Finka

et al. 2022

Preprint

View full text Add to dashboard Cite

Facial expressions in non-human animals are closely linked to their internal affective states , with the majority of empirical work focusing on facial shape changes associated with pain. However, existing tools for facial expression analysis are prone to human subjectivity and bias, and in many cases also require special expertise and training. This paper presents the first comparative study of two different paths towards automatizing pain recognition in facial images of domestic short haired cats (n = 29), captured during ovariohysterectomy at different time points corresponding to varying intensities of pain. One approach is based on convolutional neural networks (ResNet50), while the other – on machine learning models based on geometric landmarks analysis inspired by species specific Facial Action Coding Systems (i.e. catFACS). Both types of approaches reach comparable accuracy of above 72%, indicating their potential usefulness as a basis for automating cat pain detection from images.

show abstract

Section: Discussionmentioning

confidence: 99%

Automated Recognition Of Pain In Cats

Feighelstein

Shimshoni

Finka

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Chaudhari et al [141] provided a survey of attention models in deep neural networks which concentrates on their application to natural language processing, while our work focuses on computer vision. Three more specific surveys [142][143][144] summarize the development of visual transformers while our paper reviews attention mechanisms in vision more generally, not just selfattention mechanisms. Wang and Tax [145] presented a survey of attention models in computer vision, but it only considers RNN-based attention models, which form just a part of our survey.…”

Section: Other Surveysmentioning

confidence: 99%

“…A detailed survey of vision transformers is omitted here as other recent surveys [142][143][144]171] comprehensively review the use of transformer methods for visual tasks.…”

Section: Vision Transformersmentioning

confidence: 99%

Attention mechanisms in computer vision: A survey

et al. 2022

View full text Add to dashboard Cite

Humans can naturally and effectively find salient regions in complex scenes. Motivated by this observation, attention mechanisms were introduced into computer vision with the aim of imitating this aspect of the human visual system. Such an attention mechanism can be regarded as a dynamic weight adjustment process based on features of the input image. Attention mechanisms have achieved great success in many visual tasks, including image classification, object detection, semantic segmentation, video understanding, image generation, 3D vision, multimodal tasks, and self-supervised learning. In this survey, we provide a comprehensive review of various attention mechanisms in computer vision and categorize them according to approach, such as channel attention, spatial attention, temporal attention, and branch attention; a related repository https://github.com/MenghaoGuo/Awesome-Vision-Attentions is dedicated to collecting related work. We also suggest future directions for attention mechanism research.

show abstract

“…This architecture was proposed in [18] for NLP tasks, outperforming recurrent and convolutional models, which were state-of-the-art at that moment. Recently, several works have been published looking for ways to apply them to computer vision tasks [19]. This architecture relies on a self-attention mechanism, but, instead of using recurrence, transformer models create relationships between all samples in the input sequence, allowing parallelization and better use of modern devices such as TPUs and GPUs.…”

Section: Related Workmentioning

confidence: 99%

IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture

Lorenzo¹,

Parra²,

Sotelo³

2021

Preprint

View full text Add to dashboard Cite

Understanding pedestrian crossing behavior is an essential goal in intelligent vehicle development, leading to an improvement in their security and traffic flow. In this paper, we developed a method called IntFormer. It is based on transformer architecture and a novel convolutional video classification model called RubiksNet. Following the evaluation procedure in a recent benchmark, we show that our model reaches state-of-the-art results with good performance (≈ 40 seq. per second) and size (8×smaller than the best performing model), making it suitable for real-time usage. We also explore each of the input features, finding that ego-vehicle speed is the most important variable, possibly due to the similarity in crossing cases in Pedestrian Intention Estimation (PIE) dataset.

show abstract

Transformers in Vision: A Survey

Cited by 1,566 publications

References 82 publications

Automated Recognition Of Pain In Cats

Automated Recognition Of Pain In Cats

Attention mechanisms in computer vision: A survey

IntFormer: Predicting pedestrian intention with the aid of the Transformer architecture

Contact Info

Product

Resources

About