FFNet-M: Feature Fusion Network with Masks for Multimodal Facial Expression Recognition

Sui, Mingzhe; Zhu, Zhaoqing; Feng, Zhenming; Wu, Feng

doi:10.1109/icme51207.2021.9428100

Cited by 9 publications

(9 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Here, we use the gridfit interpolation [17] and projection for each 3D scan to obtain aligned RGB images and depth maps. Then, we apply the surface processing containing three steps [7], namely outlier removal, hole filling, and noisy removal to improve the data quality.…”

Section: Preprocessingmentioning

confidence: 99%

“…After that, we calculate the average of them to generate the input for ViT. The previous methods try to map a 3D scan into several threechannel pseudo-color images matching the RGB image [20] in order to directly utilize common backbone networks such as VGG16 [21] and ResNet [22] for processing 3D information [4,7]. Therefore, these approaches usually require multi-branch networks with independent parameters to handle different-modal data and fuse them at the feature level.…”

Section: Alternative Fusion Strategymentioning

confidence: 99%

“…We use the highest two intensity levels for training and testing, as most previous methods did [4,7]. Some samples of the 2D texture images in BU-3DFE with the highest two levels of intensity are shown in the first two rows of Evaluation Protocol is similar for BU-3DFE and Bosphorus.…”

Section: Datasetsmentioning

confidence: 99%

“…Therefore, some researchers have attempted to merge 3D depth information that is more robust to the illumination and pose variations with 2D features. Previous deep learning-based approaches [4][5][6][7] for multimodal 2D+3D FER utilize multi-branch convolutional neural networks (CNNs) to separately extract features for each modality, thus requiring a large number of parameters and also a high cost of memory.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

Li,

Sui,

Zhu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Vision transformer (ViT) has been widely applied in many areas due to its self-attention mechanism that help obtain the global receptive field since the first layer. It even achieves surprising performance exceeding CNN in some vision tasks. However, there exists an issue when leveraging vision transformer into 2D+3D facial expression recognition (FER), i.e., ViT training needs mass data. Nonetheless, the number of samples in public 2D+3D FER datasets is far from sufficient for evaluation. How to utilize the ViT pre-trained on RGB images to handle 2D+3D data becomes a challenge. To solve this problem, we propose a robust lightweight pure transformer-based network for multimodal 2D+3D FER, namely MFEViT. For narrowing the gap between RGB and multimodal data, we devise an alternative fusion strategy, which replaces each of the three channels of an RGB image with the depth-map channel and fuses them before feeding them into the transformer encoder. Moreover, the designed sample filtering module adds several subclasses for each expression and move the noisy samples to their corresponding subclasses, thus eliminating their disturbance on the network during the training stage. Extensive experiments demonstrate that our MFEViT outperforms state-of-the-art approaches with an accuracy of 90.83% on BU-3DFE and 90.28% on Bosphorus. Specifically, the proposed MFEViT is a lightweight model, requiring much fewer parameters than multi-branch CNNs. To the best of our knowledge, this is the first work to introduce vision transformer into multimodal 2D+3D FER. The source code of our MFEViT will be publicly available online.

show abstract

Section: Preprocessingmentioning

confidence: 99%

Section: Alternative Fusion Strategymentioning

confidence: 99%

Section: Datasetsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

Li,

Sui,

Zhu

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Jiao et al [13] proposed the FA-CNN to localize the discriminative facial parts, while the receptive fields will also focus on irrelevant areas such as the forehead, and the distribution is not stable enough from their visualization of heat maps. Sui et al [17] designed the masks to directly enhance the local features in the whole salient regions, however, diverse components make various contributions to the judgment of one expression. For example, the features of the eyes and mouth are more critical than those of the nose.…”

Section: Introductionmentioning

confidence: 99%

AFNet-M: Adaptive Fusion Network with Masks for 2D+3D Facial Expression Recognition

Sui¹,

Li²,

Zhu³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

2D+3D facial expression recognition (FER) can effectively cope with illumination changes and pose variations by simultaneously merging 2D texture and more robust 3D depth information. Most deep learning-based approaches employ the simple fusion strategy that concatenates the multimodal features directly after fully-connected layers, without considering the different degrees of significance for each modality. Meanwhile, how to focus on both 2D and 3D local features in salient regions is still a great challenge. In this letter, we propose the adaptive fusion network with masks (AFNet-M) for 2D+3D FER. To enhance 2D and 3D local features, we take the masks annotating salient regions of the face as prior knowledge and design the mask attention module (MA) which can automatically learn two modulation vectors to adjust the feature maps. Moreover, we introduce a novel fusion strategy that can perform adaptive fusion at convolutional layers through the designed importance weights computing module (IWC). Experimental results demonstrate that our AFNet-M achieves the state-of-the-art performance on BU-3DFE and Bosphorus datasets and requires fewer parameters in comparison with other models.

show abstract

DrFER: Learning Disentangled Representations for 3D Facial Expression Recognition

Li,

Yang,

Huang

2024

2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

View full text Add to dashboard Cite

FFNet-M: Feature Fusion Network with Masks for Multimodal Facial Expression Recognition

Cited by 9 publications

References 18 publications

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

MFEViT: A Robust Lightweight Transformer-based Network for Multimodal 2D+3D Facial Expression Recognition

AFNet-M: Adaptive Fusion Network with Masks for 2D+3D Facial Expression Recognition

DrFER: Learning Disentangled Representations for 3D Facial Expression Recognition

Contact Info

Product

Resources

About