2022
DOI: 10.1109/taffc.2022.3187336
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech

Abstract: The prediction of valence from speech is an important, but challenging problem. The expression of valence in speech has speaker-dependent cues, which contribute to performances that are often significantly lower than the prediction of other emotional attributes such as arousal and dominance. A practical approach to improve valence prediction from speech is to adapt the models to the target speakers in the test set. Adapting a speech emotion recognition (SER) system to a particular speaker is a hard problem, es… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(13 citation statements)
references
References 43 publications
0
13
0
Order By: Relevance
“…In contrast to prior work, we explore personalization with fine-tuned encoders instead of pre-extracted features, which achieves superior performance compared to the best-performing models. For example, our weakest baseline (HuBERT-large fine-tuning) achieves a two times higher Concordance Correlation Coefficient (CCC) compared to the reported results from Sridhar et al [14] for valence estimation. More importantly, our method is extensible and remains effective for unseen speakers without the need to re-train any components.…”
Section: Related Workmentioning
confidence: 59%
See 3 more Smart Citations
“…In contrast to prior work, we explore personalization with fine-tuned encoders instead of pre-extracted features, which achieves superior performance compared to the best-performing models. For example, our weakest baseline (HuBERT-large fine-tuning) achieves a two times higher Concordance Correlation Coefficient (CCC) compared to the reported results from Sridhar et al [14] for valence estimation. More importantly, our method is extensible and remains effective for unseen speakers without the need to re-train any components.…”
Section: Related Workmentioning
confidence: 59%
“…However, most of the existing work are validated on datasets with limited speakers. Most relevant to our work is the unsupervised personalized method proposed by Sridhar et al [14], which is validated on the same dataset (MSP-Podcast) as in this paper. They propose to find speakers in the train set to form the adaptation set whose acoustic patterns closely resemble those of the speakers in the test set.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…This reveals that capturing s t in valence by only relying on audio is a challenging task. It is a common trend in literature that the audio modality insufficiently explains ground-truth valence m t [13], [58], and this trend is even more challenging for modeling s t in valence.…”
Section: Qualitative Analysis Of Estimatesmentioning
confidence: 99%