Interspeech 2019 2019
DOI: 10.21437/interspeech.2019-1658
|View full text |Cite
|
Sign up to set email alerts
|

Robust Speech Emotion Recognition Under Different Encoding Conditions

Abstract: In an era where large speech corpora annotated for emotion are hard to come by, and especially ones where emotion is expressed freely instead of being acted, the importance of using free online sources for collecting such data cannot be overstated. Most of those sources, however, contain encoded audio due to storage and bandwidth constraints, often in very low bitrates. In addition, with the increased industry interest on voice-based applications, it is inevitable that speech emotion recognition (SER) algorith… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
2

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 17 publications
0
2
0
Order By: Relevance
“…Although this field has seen tremendous progress in the last decades [1], three major challenges remain for real-world paralinguistics-based SER applications: a) improving on its inferior valence performance [4,8], b) overcoming issues of generalisation and robustness [12,13], and c) alleviating individual-and group-level fairness concerns, which is a prerequisite for ethical emotion recognition technology [14,15]. Previous works have attempted to tackle these issues in isolation, e. g. by using cross-modal knowledge distillation to increase valence performance [16], speech enhancement or data augmentation to improve robustness [12,13], and de-biasing techniques to mitigate unfair outcomes [17]. However, each of those approaches comes with its own knobs to twist and hyperparameters to tune, making their combination far from straightforward.…”
Section: Introductionmentioning
confidence: 99%
“…Although this field has seen tremendous progress in the last decades [1], three major challenges remain for real-world paralinguistics-based SER applications: a) improving on its inferior valence performance [4,8], b) overcoming issues of generalisation and robustness [12,13], and c) alleviating individual-and group-level fairness concerns, which is a prerequisite for ethical emotion recognition technology [14,15]. Previous works have attempted to tackle these issues in isolation, e. g. by using cross-modal knowledge distillation to increase valence performance [16], speech enhancement or data augmentation to improve robustness [12,13], and de-biasing techniques to mitigate unfair outcomes [17]. However, each of those approaches comes with its own knobs to twist and hyperparameters to tune, making their combination far from straightforward.…”
Section: Introductionmentioning
confidence: 99%
“…In particular, this lets us investigate whether the fine-tuning is necessary for adapting to acoustic mismatches between the pre-training and downstream domains, as previously shown for convolutional neural networks (CNNs) [19], or to better leverage linguistic information. This type of behavioural testing goes beyond past work that typically investigates SER models' robustness with respect to noise and small perturbations [20,21,22] or fairness [23,24], thus, providing better insights into the inner workings of SER models.…”
Section: Introductionmentioning
confidence: 99%