2021
DOI: 10.48550/arxiv.2110.04425
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Arabic Speech Emotion Recognition Employing Wav2vec2.0 and HuBERT Based on BAVED Dataset

Omar Mohamed,
Salah A. Aly

Abstract: Recently, there have been tremendous research outcomes in the fields of speech recognition and natural language processing. This is due to the well-developed multilayers deep learning paradigms such as wav2vec2.0, Wav2vecU, WavBERT, and HuBERT that provide better representation learning and high information capturing. Such paradigms run on hundreds of unlabeled data, then fine-tuned on a small dataset for specific tasks. This paper introduces a deep learning constructed emotional recognition model for Arabic s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
10
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(12 citation statements)
references
References 12 publications
1
10
1
Order By: Relevance
“…They even included the feature selection method, but the Linear SVM classifier was used, but with a comparatively lightweight model, they obtained an accuracy of 96.02%, which is much closer to our findings. However, the proposed model was inferior to the model produced by [ 51 ] using the BAVED database. In another work [ 52 ] this technique is also applied.…”
Section: Discussioncontrasting
confidence: 59%
“…They even included the feature selection method, but the Linear SVM classifier was used, but with a comparatively lightweight model, they obtained an accuracy of 96.02%, which is much closer to our findings. However, the proposed model was inferior to the model produced by [ 51 ] using the BAVED database. In another work [ 52 ] this technique is also applied.…”
Section: Discussioncontrasting
confidence: 59%
“…Compared with the results in Table 2 , the proposed W2V-BLSTM-FT classifier performed better in all the measures than the AE-BLSTM-JT using eGeMAPS features, which is because the pretrained wav2vec2.0 model implicitly extracted the critical features for the ASD/TD classification, in contrast to the eGeMAPS, for which the feature extraction is based on the deterministic approach. In other words, data manipulation in an E2E manner benefits this ASD/TD classification, as researchers have reported in other tasks [ 28 , 29 , 32 , 33 , 34 ].…”
Section: Methodssupporting
confidence: 51%
“…The pretrained wav2vec2.0 model is a follow-up model of the wav2vec and VQ-wav2vec models [ 30 , 31 ], which can learn a representation of the raw waveform without labeled phonemes or graphemes. Researchers widely employ the model as a pretrained model in audio- and speech-processing tasks [ 32 , 33 , 34 ], as it has the advantage of from them from having to select the best predefined feature set task by task. In addition, the pretrained model usually comprises numerous parameters and is a priori trained with many speech and audio datasets, without regard to a specific task.…”
Section: Proposed End-to-end Asd/td Classification Based On Pretraine...mentioning
confidence: 99%
See 1 more Smart Citation
“…Recent work on speech recognition focuses on the way speakers are stressed, emotional, and disguised in their speeches [6]. In this work, we aim to develop a deep learning model for voice identification in Arabic speech.…”
Section: Introductionmentioning
confidence: 99%