2020
DOI: 10.1609/aaai.v34i02.5492
|View full text |Cite
|
Sign up to set email alerts
|

M3ER: Multiplicative Multimodal Emotion Recognition using Facial, Textual, and Speech Cues

Abstract: We present M3ER, a learning-based method for emotion recognition from multiple input modalities. Our approach combines cues from multiple co-occurring modalities (such as face, text, and speech) and also is more robust than other methods to sensor noise in any of the individual modalities. M3ER models a novel, data-driven multiplicative fusion method to combine the modalities, which learn to emphasize the more reliable cues and suppress others on a per-sample basis. By introducing a check step which uses Canon… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
110
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 201 publications
(111 citation statements)
references
References 24 publications
1
110
0
Order By: Relevance
“…"A", "T", and "V" refer to the audio, text, and video modalities respectively. Results from [6] are not included since the test setting was not clear. Results from [13] and [32] were not obtained using leave-one-speaker-out 10-fold CV and thus not directly comparable.…”
Section: Use Of Asr Transcriptionsmentioning
confidence: 99%
See 2 more Smart Citations
“…"A", "T", and "V" refer to the audio, text, and video modalities respectively. Results from [6] are not included since the test setting was not clear. Results from [13] and [32] were not obtained using leave-one-speaker-out 10-fold CV and thus not directly comparable.…”
Section: Use Of Asr Transcriptionsmentioning
confidence: 99%
“…Although significant progress has been made [1][2][3], AER is still a challenging research problem since human emotions are inherently complex, ambiguous, and highly personal. Humans often express their emotions using multiple simultaneous approaches, such as voice characteristics, linguistic content, facial expressions, and body actions, which makes AER by nature a complex multimodal task [4][5][6]. Furthermore, due to the difficulties in data collection, publicly available datasets often do not have enough speakers to properly cover personal variations in emotion expression.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The studies focusing on multimodal fusion experiment on multimodal data for the purpose of accurate improvement of emotion recognition, but are short of the empirical evidence to prove the effectiveness of models when some of modalities are unavailable. Mittal et al [95] propose M3ER that utilizes Modality Check Step to replace unavailable modality with proxy feature and fuses multimodal features by multiplicative fusion module. M3ER is a promising technique but similarly lack of experiments in unimodality and bimodality.…”
Section: Unified Modelmentioning
confidence: 99%
“…However, in general, most multi-modal fusion techniques require for the testing phase the simultaneous presence of all the modalities that were used during the model training phase [1]. This requirement becomes a severe limitation in case one or more sensors are missing or their signals are severely corrupted by noise during testing, unless such situations are explicitly handled by the modelling framework [8]. Thus, it would be desirable to improve the testing performance of individual modalities using other modalities during training [3][9] [10].…”
Section: Introductionmentioning
confidence: 99%