2021
DOI: 10.48550/arxiv.2104.03502
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Emotion Recognition from Speech Using Wav2vec 2.0 Embeddings

Abstract: Emotion recognition datasets are relatively small, making the use of the more sophisticated deep learning approaches challenging. In this work, we propose a transfer learning method for speech emotion recognition where features extracted from pre-trained wav2vec 2.0 models are modeled using simple neural networks. We propose to combine the output of several layers from the pre-trained model using trainable weights which are learned jointly with the downstream model. Further, we compare performance using two di… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 17 publications
(23 citation statements)
references
References 32 publications
0
23
0
Order By: Relevance
“…The multiplet loss improved the accuracy by around 3% on the two datasets. [19] 2020 71.61% Muppidi and Radfar [16] 2021 77.87% Mustaqeem and Kwon [17] 2020 79.50% RAVDESS Mustaqeem and Kwon [18] 2020 80.00% Seo and Kim [23] 2020 83.33% Pepino et al [21] 2021…”
Section: Results and Comparisonsmentioning
confidence: 99%
See 1 more Smart Citation
“…The multiplet loss improved the accuracy by around 3% on the two datasets. [19] 2020 71.61% Muppidi and Radfar [16] 2021 77.87% Mustaqeem and Kwon [17] 2020 79.50% RAVDESS Mustaqeem and Kwon [18] 2020 80.00% Seo and Kim [23] 2020 83.33% Pepino et al [21] 2021…”
Section: Results and Comparisonsmentioning
confidence: 99%
“…Finally, a fully connected layer followed by a softmax layer is used to identify emotions. Pepino et al [21] used pre-trained wav2vec, a framework for extracting representations from raw audio data. The extracted features, the eGeMAPS descriptors, and the spectrograms, are used as inputs for a shallow neural network.…”
Section: State Of the Artmentioning
confidence: 99%
“…Features can then be devised from the internal representations of the model. W2V has shown promising performance in SER tasks [80,81]. For our purposes, we choose the same fine-tuned XLSR-Wav2Vec2 model 5 [55] as for the transcriptions.…”
Section: Audiomentioning
confidence: 99%
“…For acoustic modality, wav2vec2.0 embeddings (without finetune) perform best for GMFN and Bert-MAG model. According to literature (Chen and Rudnicky, 2021;Pepino et al, 2021), finetuning wav2vec2.0 can further improve model performance which might provide more effective acoustic features for future MSA research. For Visual modality, the combination of facial landmarks and action units achieves the overall best result, revealing the effectiveness of both landmarks and action units for sentiment classification.…”
Section: Model Training Modulementioning
confidence: 99%