Interspeech 2005 2005
DOI: 10.21437/interspeech.2005-379
|View full text |Cite
|
Sign up to set email alerts
|

Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles

Abstract: Herein we present a comparison of novel concepts for a robust fusion of prosodic and verbal cues in speech emotion recognition. Thereby 276 acoustic features are extracted out of a spoken phrase. For linguistic content analysis we use the Bag-of-Words text representation. This allows for integration of acoustic and linguistic features within one vector prior to a final classification. Extensive feature selection by filter-and wrapper based methods is fulfilled. Likewise optimal sets via SVM-SFFS and single fea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
20
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 90 publications
(20 citation statements)
references
References 10 publications
0
20
0
Order By: Relevance
“…Human emotions are expressed in, and can accordingly be identified from, different modalities, such as speech, gestures, and facial expressions [4,5,6]. Consequently, the research community has long sought efficient ways to utilise multimodal information in an attempt to improve recognition performance and arrive at a more holistic understanding of human behaviour and communication [7,8]. A plethora of such works have investigated different ways to improve AER by combining several information streams: e. g. audio, video, text, gestures, and physiological signals.…”
Section: Introductionmentioning
confidence: 99%
“…Human emotions are expressed in, and can accordingly be identified from, different modalities, such as speech, gestures, and facial expressions [4,5,6]. Consequently, the research community has long sought efficient ways to utilise multimodal information in an attempt to improve recognition performance and arrive at a more holistic understanding of human behaviour and communication [7,8]. A plethora of such works have investigated different ways to improve AER by combining several information streams: e. g. audio, video, text, gestures, and physiological signals.…”
Section: Introductionmentioning
confidence: 99%
“…In general, it is also commonly believed that the emotional state in a speech has an impact on the speech production mechanism across the glottal source and vocal tract of the individuals [31]. The studies prompt us to investigate speech emotions from a speakerindependent [27] perspective. Studies have also shown possible ways of speaker-independent emotion representation for both seen and unseen speakers over a large multi-speakers emotional corpus [23], emotion feature extraction and classifiers [32].…”
Section: Speaker-independent Perspective On Emotionmentioning
confidence: 99%
“…To validate the idea of speaker-independent emotion elements across speakers [23,27], we conduct a preliminary study using CycleGAN-based emotional voice conversion framework [22], which is designed for speaker-dependent EVC. In this study, we train a network with two conversion pipelines for the mapping of spectrum and prosody (CWT-based F0 features) respectively.…”
Section: Speaker-independent Perspective On Emotionmentioning
confidence: 99%
See 1 more Smart Citation
“…Many results are computed with randomly-split training, validation and test sets, without separating speakers, as in [21]. Many rely on different preprocessing [22,23], on different architectures [22] or use multi-modal features rather than only audio [23]. Rather than aiming at a state-of-art classification accuracy for these datasets, we focus on evaluating the performance of MTS layers compared to standard convolution with the same number of channels, i.e.…”
Section: Iemocap the Interactive Emotional Dyadic Motionmentioning
confidence: 99%