On Enhancing Speech Emotion Recognition Using Generative Adversarial Networks

Sahu, Saurabh; Gupta, Rahul; Espy‐Wilson, Carol Y.

doi:10.21437/interspeech.2018-1883

Cited by 46 publications

(45 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Automatic speech emotion recognition has begun to take advantage of these methods. Sahu et al [13] explored how GANs could be used to augment utterance-level features, while Chang et al [14] used DCGANs to improve performance on spectrograms.…”

Section: Adversarial Methodsmentioning

confidence: 99%

“…More recently, speech research has followed the popularization of adversarial methods, including Generative Adversarial Networks (GANs) [11], [12], [13], [14], Wassersteain GANS (WGANS) [15], and CycleGANs [16], [17], [18]. However, many of these generative speech transfer models introduce noise and have a long way to go, as explored by Kaneko et al [18].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train "meet in the middle" approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings. , and a co-author of the winner of the Classifier Sub-Challenge event at the Interspeech 2009 emotion challenge. Her research interests are in human-centered speech and video processing, multimodal interfaces design, and speech-based assistive technology. The goals of her research are motivated by the complexities of the perception and expression of human behavior.

show abstract

Section: Adversarial Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Gideon

McInnis

Provost

2021

IEEE Trans. Affective Comput.

View full text Add to dashboard Cite

show abstract

“…In a different way, we apply GANs based adversarial training to generate robust representations across domains (speaker to be specific) for speech emotion recognition. Among previous work on SER, GANs are mainly utilized to learn discriminative representations [13] and conduct data augmentation [14]. Our method is different in that we aim to disentangle speaker information and learn speaker-invariant representations for SER.…”

Section: Related Workmentioning

confidence: 99%

Speaker-Invariant Affective Representation Learning via Adversarial Training

Tao²,

Huang³

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Representation learning for speech emotion recognition is challenging due to labeled data sparsity issue and lack of gold-standard references. In addition, there is much variability from input speech signals, human subjective perception of the signals and emotion label ambiguity. In this paper, we propose a machine learning framework to obtain speech emotion representations by limiting the effect of speaker variability in the speech signals. Specifically we propose to disentangle the speaker characteristics from emotion through an adversarial training network in order to better represent emotion. Our method combines the gradient reversal technique with an entropy loss function to remove such speaker information. Our approach is evaluated on both IEMOCAP and CMU-MOSEI datasets. We show that our method improves speech emotion classification and increases generalization to unseen speakers.

show abstract

“…In [4,5,6,7], CNN and LSTM based models are explored from feature representations such as MFCC and OpenS-MILE [8] features. In [9,10,11,12], adversarial learning paradigm * Both the authors contributed equally to this paper is explored for robust recognition. In [13,14], transfer learning approach is explored.…”

Section: Introductionmentioning

confidence: 99%

X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition

Pappagari

Wang

Villalba

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this work, we explore the dependencies between speaker recognition and emotion recognition. We first show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning. Then, we show the effect of emotion on speaker recognition. For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models such as the x-vector model. Then, we improve emotion recognition performance by finetuning for emotion classification. We evaluated our experiments on three different types of datasets: IEMOCAP, MSP-Podcast, and Crema-D. By fine-tuning, we obtained 30.40%, 7.99%, and 8.61% absolute improvement on IEMOCAP, MSP-Podcast, and Crema-D respectively over baseline model with no pre-training. Finally, we present results on the effect of emotion on speaker verification. We observed that speaker verification performance is prone to changes in test speaker emotions. We found that trials with angry utterances performed worst in all three datasets. We hope our analysis will initiate a new line of research in the speaker recognition community.

show abstract

On Enhancing Speech Emotion Recognition Using Generative Adversarial Networks

Cited by 46 publications

References 18 publications

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

Speaker-Invariant Affective Representation Learning via Adversarial Training

X-Vectors Meet Emotions: A Study On Dependencies Between Emotion and Speaker Recognition

Contact Info

Product

Resources

About