Emotion Recognition from Speech Using wav2vec 2.0 Embeddings

Pepino, Leonardo; Riera, Pablo; Ferrer, Luciana

doi:10.21437/interspeech.2021-703

Cited by 202 publications

(83 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dissanayake et al [80] used the last two participants in the validation and test sets, respectively, reaching an accuracy of 56.71% on the speech modality. With a variation of this setup, Pepino et al [40] used as the test set only the last two participants and combined the 'Calm' and 'Neutral' emotions, passing from a problem with eight emotions to one with seven different classes. In these conditions, the top accuracy reached by their model was 77.5%, applying a global normalization.…”

Section: Comparative Results With Related Approachesmentioning

confidence: 99%

“…Prosody, spectral, and voice quality-based features were used to train a hierarchical DNN classifier, achieving an accuracy of 81.2% on the RAVDESS dataset. Pepino et al [40] combined hand-crafted features and deep models using eGeMAPS features together with the embeddings extracted from Wav2Vec to train a CNN model. They achieved an accuracy of 77.5% applying a global normalization on this dataset.…”

Section: Speech Emotion Recognitionmentioning

confidence: 99%

“…For example, Atila et al [43] achieved 96.1% accuracy, and although they used a 10-CV evaluation, they did not specify how they distributed the users in their folds, not being clear whether in each fold the same user takes part of the training and the test set or not, which is crucial information to replicate their setup for comparing proposals. Another example of a different setup appears in Pepino et al [40], where they used 20 users for training, two for validation, and two for the test set, classifying only seven of the eight emotions that the dataset has. In these conditions, models' performance would be subject to two single actors' evaluation, which may not reflect the real-world representation of other individuals.…”

Section: Speech Emotion Recognitionmentioning

confidence: 99%

See 2 more Smart Citations

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Jiménez

Griol

Callejas

et al. 2021

Sensors

View full text Add to dashboard Cite

Emotion Recognition is attracting the attention of the research community due to the multiple areas where it can be applied, such as in healthcare or in road safety systems. In this paper, we propose a multimodal emotion recognition system that relies on speech and facial information. For the speech-based modality, we evaluated several transfer-learning techniques, more specifically, embedding extraction and Fine-Tuning. The best accuracy results were achieved when we fine-tuned the CNN-14 of the PANNs framework, confirming that the training was more robust when it did not start from scratch and the tasks were similar. Regarding the facial emotion recognizers, we propose a framework that consists of a pre-trained Spatial Transformer Network on saliency maps and facial images followed by a bi-LSTM with an attention mechanism. The error analysis reported that the frame-based systems could present some problems when they were used directly to solve a video-based task despite the domain adaptation, which opens a new line of research to discover new ways to correct this mismatch and take advantage of the embedded knowledge of these pre-trained models. Finally, from the combination of these two modalities with a late fusion strategy, we achieved 80.08% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. The results revealed that these modalities carry relevant information to detect users’ emotional state and their combination enables improvement of system performance.

show abstract

Section: Comparative Results With Related Approachesmentioning

confidence: 99%

Section: Speech Emotion Recognitionmentioning

confidence: 99%

Section: Speech Emotion Recognitionmentioning

confidence: 99%

See 1 more Smart Citation

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Jiménez

Griol

Callejas

et al. 2021

Sensors

View full text Add to dashboard Cite

show abstract

“…Other works such as [76] used the last two participants in the validation and test sets, respectively, reaching an accuracy of 56.71% on the speech modality. With a variation of this set-up, we also found the work of Pepino et al [34], which used as the test set only the last two participants and combined the 'Calm' and 'Neutral' emotions, passing from a problem with eight emotions to a problem with seven different classes. On these conditions, the top accuracy reached by their model is 77.5%, applying a global normalization.…”

Section: Comparative Results With Previous Workmentioning

confidence: 99%

“…For example, Singh et al [33] suggested the use of prosody, spectral-information, and voice quality, to train a hierarchical DNN classifier, reaching an accuracy of 81.2% on RAVDESS. Pepino et al [34] combined eGeMAPS features with the embeddings extracted from an xlsr-Wav2Vec2.0 to train a CNN model. They achieved an accuracy of 77.5% by applying a global normalization on this dataset.…”

Section: Speech Emotion Recognitionmentioning

confidence: 99%

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

et al. 2021

View full text Add to dashboard Cite

Emotion recognition is attracting the attention of the research community due to its multiple applications in different fields, such as medicine or autonomous driving. In this paper, we proposed an automatic emotion recognizer system that consisted of a speech emotion recognizer (SER) and a facial emotion recognizer (FER). For the SER, we evaluated a pre-trained xlsr-Wav2Vec2.0 transformer using two transfer-learning techniques: embedding extraction and fine-tuning. The best accuracy results were achieved when we fine-tuned the whole model by appending a multilayer perceptron on top of it, confirming that the training was more robust when it did not start from scratch and the previous knowledge of the network was similar to the task to adapt. Regarding the facial emotion recognizer, we extracted the Action Units of the videos and compared the performance between employing static models against sequential models. Results showed that sequential models beat static models by a narrow difference. Error analysis reported that the visual systems could improve with a detector of high-emotional load frames, which opened a new line of research to discover new ways to learn from videos. Finally, combining these two modalities with a late fusion strategy, we achieved 86.70% accuracy on the RAVDESS dataset on a subject-wise 5-CV evaluation, classifying eight emotions. Results demonstrated that these modalities carried relevant information to detect users’ emotional state and their combination allowed to improve the final system performance.

show abstract

Speech Emotion Recognition from Social Media Voice Messages Recorded in the Wild

Gómez-Zaragozá

Marín‐Morales

Parra

et al. 2020

Communications in Computer and Information Science

View full text Add to dashboard Cite

Speech is the most natural way for human communication, carrying the emotional state of the speaker that plays an important role in social interaction. Currently, many instant messaging apps offer the possibility of exchanging voice audios with other users. As a result, a great amount of voice data is generated every day, representing a new challenging approach for speech emotion recognition in real environments. In this study, we investigated emotion recognition from voice messages recorded in the wild using machine-learning algorithms. Unlike most research in this field, which use databases based on emotions evoked in lab environments, simulated by actors or subjectively selected from radio or TV talks, we created an ecological speech dataset with audios from real WhatsApp conversations of 30 Spanish speakers. Four external evaluators labelled each audio in terms of arousal and valence using the Self-Assessment Manikin (SAM) procedure. Pre-processing techniques were applied to the audios and different time and frequency domain features were extracted. Supervised machine learning classifiers were computed using feature reduction and hyper-parameter tuning in order to recognize the affective state of each voice message. The best recognition rate was obtained with Support Vector Machines, achieving 71.37% along the arousal dimension and 70.73% along the valence dimension. These results support the use of emotion recognition models on daily communication apps, helping to understand social human behavior and their interactions with devices in the real world.

show abstract

Emotion Recognition from Speech Using wav2vec 2.0 Embeddings

Cited by 202 publications

References 0 publications

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

Multimodal Emotion Recognition on RAVDESS Dataset Using Transfer Learning

A Proposal for Multimodal Emotion Recognition Using Aural Transformers and Action Units on RAVDESS Dataset

Speech Emotion Recognition from Social Media Voice Messages Recorded in the Wild

Contact Info

Product

Resources

About