2023
DOI: 10.1109/taffc.2022.3226473
|View full text |Cite
|
Sign up to set email alerts
|

Vision Transformer With Attentive Pooling for Robust Facial Expression Recognition

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
8
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 57 publications
(9 citation statements)
references
References 60 publications
1
8
0
Order By: Relevance
“…9. VTFF [28], TransFER [24], Facial Chirality [45], APViT [26], POSTER [25], POSTER++ [27], and ARBEx [47] are transformer-based architecture, whereas RAN [5], SCAN-CCI [9], ARM [44], EAC [43], and DDAMFN [46] are CNN- Methods # Params RAF-DB FERPlus VTFF [28] 80.1M -88.81 RAN [5] 11.2M 86.90 89.16 VTFF [28] 51.8M 88.14 -SCAN-CCI [9] 70M 89.02 89.42 EAC [43] 11.2M 89.99 89.64 ARM [44] 11.2M 90.42 -TransFER [24] 65.2M 90.91 90.83 Facial Chirality [45] 46.2M 91.20 -DDAMFN [46] 4.11M 91.35 90.74 APViT [26] 65.2M 91.98 90.86 POSTER [25] 71.8M 92.05 91.62 POSTER++ [27] 43 based architecture.…”
Section: Methodsmentioning
confidence: 99%
“…9. VTFF [28], TransFER [24], Facial Chirality [45], APViT [26], POSTER [25], POSTER++ [27], and ARBEx [47] are transformer-based architecture, whereas RAN [5], SCAN-CCI [9], ARM [44], EAC [43], and DDAMFN [46] are CNN- Methods # Params RAF-DB FERPlus VTFF [28] 80.1M -88.81 RAN [5] 11.2M 86.90 89.16 VTFF [28] 51.8M 88.14 -SCAN-CCI [9] 70M 89.02 89.42 EAC [43] 11.2M 89.99 89.64 ARM [44] 11.2M 90.42 -TransFER [24] 65.2M 90.91 90.83 Facial Chirality [45] 46.2M 91.20 -DDAMFN [46] 4.11M 91.35 90.74 APViT [26] 65.2M 91.98 90.86 POSTER [25] 71.8M 92.05 91.62 POSTER++ [27] 43 based architecture.…”
Section: Methodsmentioning
confidence: 99%
“…1, models with more parameters are not always better. APViT [32] is a recently proposed state-of-the-art method that combines both CNN and ViT for feature extraction. It boosts IR-50 from 30.78 to 35.48.…”
Section: Methodsmentioning
confidence: 99%
“…A comparison method was selected among SOTA FER studies specialised for wild FER with 1) a patch-based attention CNN mechanism (pACNN) [23], 2) adversarial graph representation (AGR) [55], 3) region attention network (RAN) [32], 4) feature decomposition and reconstruction learning (FDRL) [28], 5) EfficientFace [56], 6) deep attentive centre loss (DACL) [20], 7) latent distribution mining and pairwise uncertainty estimation (DMUE) [57], 8) relative uncertainty learning (RUL) [58], 9) FER with visual transformers with feature fusion (VTFF) [59], 10) squeeze vision transformer (Squeeze-ViT) [35], 11) FER through the meta (Face2Exp) [29], and 12) vision transformer with attention pooling (APViT) [60]. As shown in Table 1, the proposed method combining the face graph and a GCN showed the best performance in all three datasets, except AffectNet-8.…”
Section: A Performance Comparison With State-of-art Approachesmentioning
confidence: 99%
“…However, because AffectNet-8 used pre-labelled landmarks to train the model, there was a slight performance degradation compared to other datasets. APViT [60] is a method that combines CNNs and vision transformers, and showed good performance in both AffectNet-8 and RAF-DB because it was pre-trained with different datasets. However, when only scratch learning was applied without pre-training like our method, the performance was very poor compared to the proposed method.…”
Section: A Performance Comparison With State-of-art Approachesmentioning
confidence: 99%