ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747870
|View full text |Cite
|
Sign up to set email alerts
|

Speech Emotion Recognition Using Self-Supervised Features

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3
1
1

Relationship

0
10

Authors

Journals

citations
Cited by 76 publications
(12 citation statements)
references
References 22 publications
0
12
0
Order By: Relevance
“…In order to assess the robustness and efficiency of the proposed architecture from a variety of angles, we tested the model on three datasets and conducted a large number of ablation analyses, by this verifying the influence of parameter variables on the predictions. In comparison with earlier studies, we provided a more potent state-of-the-art endto-end model for SER, whose adaptability will encourage the future development of multi-model speech emotion recognition, i. e., by taking advantage of other modalities, such as video and text [68,69,70], Besides, we will also considered how to use chunk-level segments features to create a self-supervised learning framework [71], such as masking some chunk segments during the feature input process and performing contrastive loss on the model output as shown for wav2vec 2.0 [72,73].…”
Section: Discussionmentioning
confidence: 99%
“…In order to assess the robustness and efficiency of the proposed architecture from a variety of angles, we tested the model on three datasets and conducted a large number of ablation analyses, by this verifying the influence of parameter variables on the predictions. In comparison with earlier studies, we provided a more potent state-of-the-art endto-end model for SER, whose adaptability will encourage the future development of multi-model speech emotion recognition, i. e., by taking advantage of other modalities, such as video and text [68,69,70], Besides, we will also considered how to use chunk-level segments features to create a self-supervised learning framework [71], such as masking some chunk segments during the feature input process and performing contrastive loss on the model output as shown for wav2vec 2.0 [72,73].…”
Section: Discussionmentioning
confidence: 99%
“…Equation ( 4) shows how the CCC is computed given prediction ĉ, and ground truth c, where s cĉ , s 2 c , s 2 ĉ , c, c represent the covariance between the ground truth and prediction, variance of the ground truth, variance of the prediction, mean of the groundtruth and mean of the prediction respectively. 1 shows our experimental results on the IEMOCAP dataset, where all results are computed using standard 5-fold cross validation [17]. We observe that multi-task training improves CCC by 0.02 over the continuous baseline, however, it does not outperform the discrete baseline.…”
Section: Evaluation Metricsmentioning
confidence: 94%
“…For more intricate scenarios, the latter approach appears more effective. [10] Addressing complex contexts necessitates comprehensive datasets with rich contextual information and substantial samples. Combining these with gesture and text signals could further enhance SER's versatility.…”
Section: Dataset Variabilitymentioning
confidence: 99%