2023
DOI: 10.1109/access.2023.3321122
|View full text |Cite
|
Sign up to set email alerts
|

Mel-MViTv2: Enhanced Speech Emotion Recognition With Mel Spectrogram and Improved Multiscale Vision Transformers

Kah Liang Ong,
Chin Poo Lee,
Heng Siong Lim
et al.

Abstract: Speech emotion recognition aims to automatically identify and classify emotions from speech signals. It plays a crucial role in various applications such as human-computer interaction, affective computing, and social robotics. Over the years, researchers have proposed different approaches for speech emotion recognition, leveraging various classifiers and features. However, despite the advancements, existing methods in speech emotion recognition still have certain limitations. Some approaches rely on handcrafte… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 9 publications
(3 citation statements)
references
References 18 publications
0
3
0
Order By: Relevance
“…These features serve as the The comprehensive benchmark studies in Table 9, with their respective subset of features, datasets, and accuracy scores, show a snapshot of the broader research landscape in SER. Multiple features based speech emotion recognition systems are proposed considering distinct machine learning models such as voting classifier [19], [61], attentionbased multi-learning model (ABMD) [23], 1D-CNN [26] and MViTv2 [60]. However, these multi-featured emotion recognition systems target a particular region accent.…”
Section: A Discussionmentioning
confidence: 99%
“…These features serve as the The comprehensive benchmark studies in Table 9, with their respective subset of features, datasets, and accuracy scores, show a snapshot of the broader research landscape in SER. Multiple features based speech emotion recognition systems are proposed considering distinct machine learning models such as voting classifier [19], [61], attentionbased multi-learning model (ABMD) [23], 1D-CNN [26] and MViTv2 [60]. However, these multi-featured emotion recognition systems target a particular region accent.…”
Section: A Discussionmentioning
confidence: 99%
“…This technique visualizes the changes in frequency over time, aiding in understanding the complex structures within audio signals. Designed to mimic the characteristics of human hearing, the Mel scale processes frequencies akin to how humans perceive sound, sharing similarities with MFCC analysis but emphasizing the visual representation of audio signals [14]- [15].…”
Section: B Mel Spectrogram Analysismentioning
confidence: 99%
“…Signal intra-pulse sequences are transformed into gray-scaled STFT spectrograms, and a CNN network is designed to extract features and classify the spectrogram [25]. Kah Liang Ong proposes a speech emotion recognition method that combines the Mel spectrogram with the Short-Term Fourier Transform (Mel-STFT) and Improved Multiscale Vision Transformers (MViTv2) [26]. However, Markov transfer field images possess several advantages in comparison with time-frequency maps and GAF transformations.…”
Section: Introductionmentioning
confidence: 99%