2023
DOI: 10.1109/tbc.2022.3215245
|View full text |Cite
|
Sign up to set email alerts
|

FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
1
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(2 citation statements)
references
References 31 publications
0
1
0
Order By: Relevance
“…Compared to the two-phase models considered here for evaluation, some recently proposed fully end-to-end models such as the ones in [80], [84], [101] demonstrated improved emotion recognition performance but at the cost of significantly increased model training complexity. Although the COLD fusion framework is not evaluated in such models in this work, its ability to achieve robust multimodal fusion can be extended to fully end-to-end models as well for additional performance gains.…”
Section: B Categorical Emotion Recognition Resultsmentioning
confidence: 99%
“…Compared to the two-phase models considered here for evaluation, some recently proposed fully end-to-end models such as the ones in [80], [84], [101] demonstrated improved emotion recognition performance but at the cost of significantly increased model training complexity. Although the COLD fusion framework is not evaluated in such models in this work, its ability to achieve robust multimodal fusion can be extended to fully end-to-end models as well for additional performance gains.…”
Section: B Categorical Emotion Recognition Resultsmentioning
confidence: 99%
“…Zhao et al [ 45 ] proposed a novel multimodal transformer-based pretrained model, MEmoBERT, under self-supervised learning on a large-scale unlabelled movie dataset and further adopted a prompt-based learning method to adapt it to downstream tasks for multimodal emotion recognition, especially under low-resource conditions. Wei et al [ 46 ] considered the computational and storage problem caused by a large number of long-length and high-resolution videos in the 5G and self-media era, and they transferred the success of transformer in vision into the audio modality and introduced the Rep VGG-based single-branch inference module for multimodal emotion recognition tasks; extensive experiments on the IEMOCAP and CMU-MOSEI datasets demonstrate the effectiveness of these methods, but for some similar emotional expressions like happy and neutral, the proposed model performs not so well in discriminating between these emotions. How to distinguish similar emotions and extract fine-grained information is also a problem that needs to be studied in future research.…”
Section: Related Workmentioning
confidence: 99%