Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles

Schuller, Björn W.; Müller, Ronald; Lang, M.; Rigoll, Gerhard

doi:10.21437/interspeech.2005-379

Cited by 90 publications

(20 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Human emotions are expressed in, and can accordingly be identified from, different modalities, such as speech, gestures, and facial expressions [4,5,6]. Consequently, the research community has long sought efficient ways to utilise multimodal information in an attempt to improve recognition performance and arrive at a more holistic understanding of human behaviour and communication [7,8]. A plethora of such works have investigated different ways to improve AER by combining several information streams: e. g. audio, video, text, gestures, and physiological signals.…”

Section: Introductionmentioning

confidence: 99%

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Triantafyllopoulos¹,

Reichel²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one where the streams are merged in a single point. Both methods depend on extracting summary linguistic embeddings from a pre-trained BERT model, and conditioning one or more intermediate representations of a convolutional model operating on log-Mel spectrograms. Experiments on the widely used IEMOCAP and MSP-Podcast databases demonstrate that the two fusion methods clearly outperform a shallow (late) fusion baseline and their unimodal constituents, both in terms of quantitative performance and qualitative behaviour. Our accompanying analysis further reveals a hitherto unexplored role of the underlying dialogue acts on unimodal and bimodal SER, with different models showing a biased behaviour across different acts. Overall, our multistage fusion shows better quantitative performance, surpassing all alternatives on most of our evaluations. This illustrates the potential of multistage fusion in better assimilating text and audio information.

show abstract

Section: Introductionmentioning

confidence: 99%

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Triantafyllopoulos¹,

Reichel²,

Liu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In general, it is also commonly believed that the emotional state in a speech has an impact on the speech production mechanism across the glottal source and vocal tract of the individuals [31]. The studies prompt us to investigate speech emotions from a speakerindependent [27] perspective. Studies have also shown possible ways of speaker-independent emotion representation for both seen and unseen speakers over a large multi-speakers emotional corpus [23], emotion feature extraction and classifiers [32].…”

Section: Speaker-independent Perspective On Emotionmentioning

confidence: 99%

“…To validate the idea of speaker-independent emotion elements across speakers [23,27], we conduct a preliminary study using CycleGAN-based emotional voice conversion framework [22], which is designed for speaker-dependent EVC. In this study, we train a network with two conversion pipelines for the mapping of spectrum and prosody (CWT-based F0 features) respectively.…”

Section: Speaker-independent Perspective On Emotionmentioning

confidence: 99%

“…It is believed that emotional expression and perception present individual variations influenced by personalities, languages and cultures [23][24][25][26]. Simultaneously, they also share some common cues across different individuals regardless of their identities and backgrounds [25][26][27]. In the field of emotion recognition, speaker-independent emotion recognition demonstrates a more robust, stable and a better generalization ability than the speaker-dependent ones [23].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Zhou

Şişman

Zhang

et al. 2020

Preprint

View full text Add to dashboard Cite

Emotional voice conversion aims to convert the emotion of the speech from one state to another while preserving the linguistic content and speaker identity. The prior studies on emotional voice conversion are mostly carried out under the assumption that emotion is speaker-dependent. We believe that emotions are expressed universally across speakers, therefore, the speaker-independent mapping between emotional states of speech is possible. In this paper, we propose to build a speakerindependent emotional voice conversion framework, that can convert anyone's emotion without the need for parallel data. We propose a VAW-GAN based encoder-decoder structure to learn the spectrum and prosody mapping. We perform prosody conversion by using continuous wavelet transform (CWT) to model the temporal dependencies. We also investigate the use of F0 as an additional input to the decoder to improve emotion conversion performance. Experiments show that the proposed speaker-independent framework achieves competitive results for both seen and unseen speakers.

show abstract

“…Many results are computed with randomly-split training, validation and test sets, without separating speakers, as in [21]. Many rely on different preprocessing [22,23], on different architectures [22] or use multi-modal features rather than only audio [23]. Rather than aiming at a state-of-art classification accuracy for these datasets, we focus on evaluating the performance of MTS layers compared to standard convolution with the same number of channels, i.e.…”

Section: Iemocap the Interactive Emotional Dyadic Motionmentioning

confidence: 99%

Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals

Guizzo

Weyde

Leveson

2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Robustness against temporal variations is important for emotion recognition from speech audio, since emotion is expressed through complex spectral patterns that can exhibit significant local dilation and compression on the time axis depending on speaker and context. To address this and potentially other tasks, we introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing time-frequency representations of audio data. MTS extends convolutional neural networks with convolution kernels that are scaled and re-sampled along the time axis, to increase temporal flexibility without increasing the number of trainable parameters compared to standard convolutional layers. We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes. The results show that the use of MTS layers consistently improves the generalization of networks of different capacity and depth, compared to standard convolution, especially on smaller datasets.

show abstract

Speaker independent emotion recognition by early fusion of acoustic and linguistic features within ensembles

Cited by 90 publications

References 10 publications

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Multistage linguistic conditioning of convolutional layers for speech emotion recognition

Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion

Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals

Contact Info

Product

Resources

About