2022
DOI: 10.3390/s22103729
|View full text |Cite
|
Sign up to set email alerts
|

Facial Expression Recognition Based on Squeeze Vision Transformer

Abstract: In recent image classification approaches, a vision transformer (ViT) has shown an excellent performance beyond that of a convolutional neural network. A ViT achieves a high classification for natural images because it properly preserves the global image features. Conversely, a ViT still has many limitations in facial expression recognition (FER), which requires the detection of subtle changes in expression, because it can lose the local features of the image. Therefore, in this paper, we propose Squeeze ViT, … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 28 publications
(10 citation statements)
references
References 35 publications
0
10
0
Order By: Relevance
“…However, with this method, channel attention is used only as a means for global feature extraction, and the experiment is limited to the FER 2013 dataset. Kim et al [35] proposed a squeeze vision transformer, a method to reduce computational complexity by reducing the number of feature dimensions while increasing FER performance. In this method, visual tokens and landmark heatmap-based local tokens are combined such that global and local patch characteristics can be maintained at the same time.…”
Section: Fer In the Wild With Attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…However, with this method, channel attention is used only as a means for global feature extraction, and the experiment is limited to the FER 2013 dataset. Kim et al [35] proposed a squeeze vision transformer, a method to reduce computational complexity by reducing the number of feature dimensions while increasing FER performance. In this method, visual tokens and landmark heatmap-based local tokens are combined such that global and local patch characteristics can be maintained at the same time.…”
Section: Fer In the Wild With Attentionmentioning
confidence: 99%
“…A comparison method was selected among SOTA FER studies specialised for wild FER with 1) a patch-based attention CNN mechanism (pACNN) [23], 2) adversarial graph representation (AGR) [55], 3) region attention network (RAN) [32], 4) feature decomposition and reconstruction learning (FDRL) [28], 5) EfficientFace [56], 6) deep attentive centre loss (DACL) [20], 7) latent distribution mining and pairwise uncertainty estimation (DMUE) [57], 8) relative uncertainty learning (RUL) [58], 9) FER with visual transformers with feature fusion (VTFF) [59], 10) squeeze vision transformer (Squeeze-ViT) [35], 11) FER through the meta (Face2Exp) [29], and 12) vision transformer with attention pooling (APViT) [60]. As shown in Table 1, the proposed method combining the face graph and a GCN showed the best performance in all three datasets, except AffectNet-8.…”
Section: A Performance Comparison With State-of-art Approachesmentioning
confidence: 99%
“…Influenced by the vision transformer, Xue et al [51] designed the first transformer-based FER network to model long-range dependencies for FER. Kim et al [24] improved the vision transformer (ViT) to combine both global and local features so that ViT can be adapted to FER task.…”
Section: Introductionmentioning
confidence: 99%
“…Sangwon Kim et al proposed the Squeeze Vit method to study facial expression recognition with squeezed vision converters. They set the training set (validation set + training set) and the test set of the FER dataset at 80.0% and 20.0%, respectively [8]. Xiangkui Jiang et al proposed a smoking behavior detection method based on YOLOv5 [9], using a homemade smoking behavior dataset for training and setting the proportion of training set to test set to 7:3.…”
Section: Introductionmentioning
confidence: 99%