Hierarchical Attention‐Based Multimodal Fusion Network for Video Emotion Recognition

Computational Intelligence and Neuroscience

2022

Self Cite

Video emotion recognition has attracted increasing attention. Most existing approaches are based on the spatial features extracted from video frames. The context information and their relationships in videos are often ignored. Thus, the performance of existing approaches is restricted. In this study, we propose a sparse spatial-temporal emotion graph convolutional network-based video emotion recognition method (SE-GCN). For the spatial graph, the emotional relationship between any two emotion proposal regions is first calculated and the sparse spatial graph is constructed according to the emotional relationship. For the temporal graph, the emotional information contained in each emotion proposal region is first analyzed and the sparse temporal graph is constructed by using the emotion proposal regions with rich emotional cues. Then, the reasoning features of the emotional relationship are obtained by the spatial-temporal GCN. Finally, the features of the emotion proposal regions and the spatial-temporal relationship features are fused to recognize the video emotion. Extensive experiments are conducted on four challenging benchmark datasets, that is, MHED, HEIV, VideoEmotion-8, and Ekman-6. The experimental results demonstrate that the proposed method achieves state-of-the-art performance.

Section: Methodsmentioning

confidence: 99%

Section: Experiments' Evaluationmentioning

confidence: 99%

Section: Experiments' Evaluationmentioning

confidence: 99%

See 1 more Smart Citation

Sparse Spatial-Temporal Emotion Graph Convolutional Network for Video Emotion Recognition

Liu

Computational Intelligence and Neuroscience

2022

Self Cite

“…We conduct experiments on five publicly available video emotion recognition data sets, namely the MHED data set [ 28 ], the HEIV data set [ 29 ], the ekman-6 data set [ 30 ], the videoemotion-8 data set [ 31 ], and the SFEW data set [ 32 ]. The MHED data set is composed of 1,066 videos that are manually downloaded from the network, and it uses a training set of 638 videos and a testing set of 428 videos.…”

Section: Methodsmentioning

confidence: 99%

Region Dual Attention-Based Video Emotion Recognition

Liu

Computational Intelligence and Neuroscience

2022

Self Cite

To solve the emotional differences between different regions of the video frame and make use of the interrelationship between different regions, a region dual attention-based video emotion recognition method (RDAM) is proposed. RDAM takes as input video frame sequences and learns a discriminatory video emotion representation that can make full use of the emotional differences of different regions and the interrelationship between regions. Specifically, we construct two parallel attention modules: one is the regional location attention module, which generates a weight value for each feature region to identify the relative importance of different regions. Based on the weight, the emotion feature that can perceive the emotional sensitive region is generated. The other is the regional relationship attention module, which generates a region relation matrix that represents the interrelationship of different regions of a video frame. Based on the region relation matrix, the emotion feature that can perceive interrelationship between different regions is generated. The outputs of these two attention modules are fused to produce the emotional features of video frames. Then, the features of video frame sequences are fused by attention-based fusion network, and the final emotion feature of the video is produced. The experimental results on the video emotion recognition data sets show that the proposed method outperforms the other related works.

“…To this end, recent studies designed neural networks and optimised model parameters [4]. Although previous researches [5][6][7][8][9][10][11][12][13] have achieved promising progress, it is still challenging to analyse the emotions induced by videos.…”

mentioning

confidence: 99%

Unified multi‐stage fusion network for affective video content analysis

Tang

2022

Electronics Letters

Affective video content analysis is an active topic in the field of affective computing. In general, affective video content can be depicted by feature vectors of multiple modalities, so it is important to effectively fuse information. In this work, a novel framework is designed to fuse information from multiple stages in a unified manner. In particular, a unified fusion layer is devised to combine output tensors from multiple stages of the proposed neural network. With the unified fusion layer, a bidirectional residual recurrent fusion block is devised to model the information of each modality. Moreover, the proposed method achieves state‐of‐the‐art performances on two challenging datasets, i.e. the accuracy value on the VideoEmotion dataset is 55.8%, and the MSE values on the two domains of EIMT16 are 0.464 and 0.176 respectively. The code of UMFN is available at: https://github.com/yunyi9/UMFN.