2021
DOI: 10.3390/app11177962
|View full text |Cite
|
Sign up to set email alerts
|

Deep Multimodal Emotion Recognition on Human Speech: A Review

Abstract: This work reviews the state of the art in multimodal speech emotion recognition methodologies, focusing on audio, text and visual information. We provide a new, descriptive categorization of methods, based on the way they handle the inter-modality and intra-modality dynamics in the temporal dimension: (i) non-temporal architectures (NTA), which do not significantly model the temporal dimension in both unimodal and multimodal interaction; (ii) pseudo-temporal architectures (PTA), which also assume an oversimpli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
18
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 27 publications
(18 citation statements)
references
References 75 publications
0
18
0
Order By: Relevance
“…Another stream of research on ASRs focused on speech emotion recognition (SER) [29]. In the context of human-computer or human-human interaction applications, the challenge of identifying emotions in human speech signals is critical and extremely difficult [30]. The blockchain based IoT devices and systems have been created [31].…”
Section: Related Workmentioning
confidence: 99%
“…Another stream of research on ASRs focused on speech emotion recognition (SER) [29]. In the context of human-computer or human-human interaction applications, the challenge of identifying emotions in human speech signals is critical and extremely difficult [30]. The blockchain based IoT devices and systems have been created [31].…”
Section: Related Workmentioning
confidence: 99%
“…Model-level fusion involves the fusion of latent representations obtained from different modality models at an intermediate level in order to leverage the advantages of both feature and decisionlevel fusion. Moreover, early fusion and late fusion prevent models from learning intra and inter-modality interaction dynamics [33], [34] respectively. Multimodal fusion involves the alignment of the features in different modalities explicitly or implicitly [35].…”
Section: Related Workmentioning
confidence: 99%
“…There is an entire class of solutions called multimodal-based affective human–computer interaction that enables computer systems to recognize specific affective states. Emotions in these approaches can be recognized in many ways, including those based on: Voice parameters (timbre, raised voice, speaking rate, linguistic analysis and errors made) [ 7 , 8 , 9 , 10 ]; Characteristics of writing [ 11 , 12 , 13 , 14 , 15 ]; Changes in facial expressions in specific areas of the face [ 16 , 17 , 18 , 19 , 20 ]; Gestures and posture analysis [ 21 , 22 , 23 , 24 ]; Characterization of biological signals, including but not limited to respiration, skin conductance, blood pressure, brain imaging, and brain bioelectrical signals [ 25 , 26 , 27 , 28 , 29 , 30 ]; Context—assessing the fit between the emotion and the context of expression [ 31 ]. …”
Section: Introductionmentioning
confidence: 99%
“…Voice parameters (timbre, raised voice, speaking rate, linguistic analysis and errors made) [ 7 , 8 , 9 , 10 ];…”
Section: Introductionmentioning
confidence: 99%