2021
DOI: 10.48550/arxiv.2109.13086
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

Hanting Li,
Mingzhe Sui,
Zhaoqing Zhu
et al.

Abstract: Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism that help obtain the global receptive field since the first layer. It even achieves surprising performance exceeding CNN in some vision tasks. However, there exists an issue when leveraging vision transformer into 2D+3D facial expression recognition (FER), i.e., ViT training needs mass data. Nonetheless, the number of samples in public 2D+3D FER datasets is far from sufficient for evaluation. How to utilize the V… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2

Relationship

0
2

Authors

Journals

citations
Cited by 2 publications
(2 citation statements)
references
References 20 publications
0
2
0
Order By: Relevance
“…The most common branch is based on image/video signals, and focuses on learning to extract expression features from variables embedded in the regular (grid-like) space, allowing Euclidean models such as CNNs/SVMs/PCA to extract salient features from spatial/temporal correlations [ 1 , 2 ]. Methods combining 3D and 2D [ 3 , 4 , 5 ] commonly try to restore a geometrical representation from multi-modal sensing results, e.g., raster images and point cloud scans, and then apply the texture channel as the auxiliary salient information to overcome possible perturbations such as isometry in real 3D scenes. For instance, the rigid rotation augmentation scheme called Multi-View Stereo (MVS) is a representative member [ 6 ]; however, this branch requires exhausted representation interpolation, which limits its realistic application and incurs extra exterior noise.…”
Section: Introductionmentioning
confidence: 99%
“…The most common branch is based on image/video signals, and focuses on learning to extract expression features from variables embedded in the regular (grid-like) space, allowing Euclidean models such as CNNs/SVMs/PCA to extract salient features from spatial/temporal correlations [ 1 , 2 ]. Methods combining 3D and 2D [ 3 , 4 , 5 ] commonly try to restore a geometrical representation from multi-modal sensing results, e.g., raster images and point cloud scans, and then apply the texture channel as the auxiliary salient information to overcome possible perturbations such as isometry in real 3D scenes. For instance, the rigid rotation augmentation scheme called Multi-View Stereo (MVS) is a representative member [ 6 ]; however, this branch requires exhausted representation interpolation, which limits its realistic application and incurs extra exterior noise.…”
Section: Introductionmentioning
confidence: 99%
“…For his part [9] used the WiSARD network, also [10] used Neighborhood Difference Features for the extraction of characteristics and random forest for the classification. For the analysis and classification of images, recent studies have focused on the implementation of the vision transformer model [11], [12], [13], [14], [15], [16]. Also, various studies have used the FER-2013 database for testing, for example in [17] and [18] they used a 2D convolutional network and obtained an accuracy of 66% and 94% respectively.…”
Section: Introductionmentioning
confidence: 99%