An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets

Vielzeuf, Valentin; Kervadec, Corentin; Pateux, Stéphane; Lechervy, Alexis; Jurie, Frédéric

doi:10.1145/3242969.3264980

Cited by 32 publications

(26 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For an exact and fair comparison, the numerical values of the conventional methods were quoted directly from previous studies [ 27 , 29 , 37 , 58 , 59 , 60 , 61 ]. CNN-RNN-based techniques [ 27 , 29 , 60 ] and 2D CNN-based ones [ 37 , 58 , 59 , 61 ] were examined. Note that [ 37 ] used five-fold cross-validation jointly with a training set and validation set, so it could not be fairly compared with the others.…”

Section: Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Lee

Kim

Song

2020

Sensors

View full text Add to dashboard Cite

Facial expression recognition (FER) technology has made considerable progress with the rapid development of deep learning. However, conventional FER techniques are mainly designed and trained for videos that are artificially acquired in a limited environment, so they may not operate robustly on videos acquired in a wild environment suffering from varying illuminations and head poses. In order to solve this problem and improve the ultimate performance of FER, this paper proposes a new architecture that extends a state-of-the-art FER scheme and a multi-modal neural network that can effectively fuse image and landmark information. To this end, we propose three methods. To maximize the performance of the recurrent neural network (RNN) in the previous scheme, we first propose a frame substitution module that replaces the latent features of less important frames with those of important frames based on inter-frame correlation. Second, we propose a method for extracting facial landmark features based on the correlation between frames. Third, we propose a new multi-modal fusion method that effectively fuses video and facial landmark information at the feature level. By applying attention based on the characteristics of each modality to the features of the modality, novel fusion is achieved. Experimental results show that the proposed method provides remarkable performance, with 51.4% accuracy for the wild AFEW dataset, 98.5% accuracy for the CK+ dataset and 81.9% accuracy for the MMI dataset, outperforming the state-of-the-art networks.

show abstract

Section: Resultsmentioning

confidence: 99%

“…Among the VSHNN models without the FS module, C3D and SAGRU of type B showed higher performance than the previous methods. For example, its accuracy improved by around 2.87% compared to a SOTA method [ 61 ].…”

Section: Resultsmentioning

confidence: 99%

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Lee

Kim

Song

2020

Sensors

View full text Add to dashboard Cite

show abstract

“…Accuracy CAKE [60] 68.9 DLP-CNN [58] 74.2 Vielzeuf et al [73] 80 PG-CNN [14] 83 patch augmentation. The performance of proposed methods achieve higher accuracy or close to the state-of-the-art methods; (4) A benefit of MFMP+ is observed in most in-the-lab expression datasets as they contain a small number of samples.…”

Section: Methodsmentioning

confidence: 99%

Expression recognition with deep features extracted from holistic and part-based models

Happy

Dantcheva

Brémond

2021

Image and Vision Computing

View full text Add to dashboard Cite

Facial expression recognition aims to accurately interpret facial muscle movements in affective states (emotions). Previous studies have proposed holistic analysis of the face, as well as the extraction of features pertained only to specific facial regions towards expression recognition. While classically the latter have shown better performances, we here explore this in the context of deep learning. In particular, this work provides a performance comparison of holistic and part-based deep learning models for expression recognition. In addition, we showcase the effectiveness of skip connections, which allow a network to infer from both low and high-level feature maps. Our results suggest that holistic models outperform part-based models, in the absence of skip connections. Finally, based on our findings, we propose a data augmentation scheme, which we incorporate in a part-based model. The proposed multi-face multi-part (MFMP) model leverages the wide information from part-based data augmentation, where we train the network using the facial parts extracted from different face samples of the same expression class. Extensive experiments on publicly available datasets show a significant improvement of facial expression classification with the proposed MFMP framework.

show abstract

“…Source Accuracy RAF-DB EmotioNet NCMML [8] SIP(2016) 57.70% -Capsnet [4] arXiv(2017) 76.12 % 32.64% Boosting-POOF [6] FG(2017) 73.19% 46.27% MRE-CNN [1] ICANN(2017) 76.73% -VGG16 [5] CS(2014) 80.96% 45.59% RC-DLP [9] CVPR(2017) 84.70% -Emotion classifier [10] ICMI(2018) 80.00% -GAN-Inpainting [11] CVPR(2018) 81.87% -DLP-CNN [2] IEEE TIP(2019) 84.13% -FERAtt [7] arXiv(2019) -48.63% E2-Capsnet -85.24% 55.91% Fig. 3 Visualizations of Capsnet [4], VGG16 [5], Boosting-POOF [6], FERAtt [7] and E2-Capsnet on EmotioNet by T-SNE.…”

Section: Methodsmentioning

confidence: 99%

E2‐capsule neural networks for facial expression recognition using AU‐aware attention

Cao

Yao

2020

IET image process

View full text Add to dashboard Cite

Capsule neural network is a new and popular technique in deep learning. However, the traditional capsule neural network does not extract features sufficiently before the dynamic routing between the capsules. In this paper, the one Double Enhanced Capsule Neural Network (E2-Capsnet) that uses AU-aware attention for facial expression recognition (FER) is proposed. The E2-Capsnet takes advantage of dynamic routing between the capsules, and has two enhancement modules which are beneficial for FER. The first enhancement module is the convolutional neural network with AU-aware attention, which can help focus on the active areas of the expression. The second enhancement module is the capsule neural network with multiple convolutional layers, which enhances the ability of the feature representation. Finally, squashing function is used to classify the facial expression. We demonstrate the effectiveness of E2-Capsnet on the two public benchmark datasets, RAF-DB and EmotioNet. The experimental results show that our E2-Capsnet is superior to the state-of-the-art methods. Our implementation will be publicly available online.

show abstract

An Occam's Razor View on Learning Audiovisual Emotion Recognition with Small Training Sets

Cited by 32 publications

References 31 publications

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Visual Scene-Aware Hybrid and Multi-Modal Feature Aggregation for Facial Expression Recognition

Expression recognition with deep features extracted from holistic and part-based models

E2‐capsule neural networks for facial expression recognition using AU‐aware attention

Contact Info

Product

Resources

About