Proceedings of the 2021 International Conference on Multimodal Interaction 2021
DOI: 10.1145/3462244.3481003
|View full text |Cite
|
Sign up to set email alerts
|

Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation

Abstract: Automatic speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity, i.e., insufficient amounts of carefully labeled data to build and fully explore complex deep learning models for emotion classification. This paper aims to address this challenge using a transfer learning strategy combined with spectrogram augmentation. Specifically, we propose a transfer learning approach that leverages a pre-trai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 39 publications
(15 citation statements)
references
References 50 publications
1
14
0
Order By: Relevance
“…Besides investigating performance of SER models on clean test data, it is important to show that they also work well under more challenging conditions. Even though augmentation methods have been used to improve performance on clean test data [36,37], only a few studies have evaluated performance on augmented test data as well. Jaiswal and Provost [38] and Pappagari et al [39] have shown that previous SER models show robustness issues, particularly for background noise and reverb.…”
mentioning
confidence: 99%
“…Besides investigating performance of SER models on clean test data, it is important to show that they also work well under more challenging conditions. Even though augmentation methods have been used to improve performance on clean test data [36,37], only a few studies have evaluated performance on augmented test data as well. Jaiswal and Provost [38] and Pappagari et al [39] have shown that previous SER models show robustness issues, particularly for background noise and reverb.…”
mentioning
confidence: 99%
“…To further distinguish between important and non-speech parts of the input, an attention mechanism, similar to [29], was used before the classifier. Apart from the attention-based approaches, transfer learning [33] and various augmentation techniques [25] have been developed to deal with the limited amount of available natural speech data. None of the existing SER models considers the users' privacy challenge, which can significantly affect their applicability in real-life applications.…”
Section: Related Workmentioning
confidence: 99%
“…Since the improvised corpus is closer to natural speech and can elicit more intense emotions, we use only the improvised raw audio samples from the dataset. Additionally, as most papers on SER have targeted the improvised corpus, with a focus on the detection of four core emotion -Neutral, Happy, Sad and Angry [28], [32], [33], we use these 4 emotions to be able to compare our results. Simulation Environment and Setup: To simulate a federated environment, we use the Flower framework [37] and utilize FedAvg [15] as an optimization algorithm to construct the global model from devices' local updates.…”
Section: Datasetsmentioning
confidence: 99%
“…Currently, we see the massive application of deep learning (DL) in different fields, including SER. DL techniques that are used to improve SER performance may include Deep Neural Networks (DNN) of various architectures [26], Generative Adversarial Networks (GAN) [44,45], autoencoders [46,47], Extreme Learning Machines (ELM) [48], multitask learning [49], transfer learning [50], attention mechanisms [26], etc.…”
Section: Classifiersmentioning
confidence: 99%