Speech Emotion Recognition using MFCC features and LSTM network

Kumbhar, Harshawardhan S.; Bhandari, Sheetal U.

doi:10.1109/iccubea47591.2019.9129067

Cited by 33 publications

(8 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most prior research uses CNN-based models for SER [37]. Among such models, the notable ones include AlexNet [38], VGG [39,40], and ResNet50 [41,42]. This section provides a short overview of the models.…”

Section: Architectures and Settingsmentioning

confidence: 99%

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

Retta,

Sutcliffe,

Mahmood

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

In a conventional speech emotion recognition (SER) task, a classifier for a given language is trained on a pre-existing dataset for that same language. However, where training data for a language do not exist, data from other languages can be used instead. We experiment with cross-lingual and multilingual SER, working with Amharic, English, German, and Urdu. For Amharic, we use our own publicly available Amharic Speech Emotion Dataset (ASED). For English, German and Urdu, we use the existing RAVDESS, EMO-DB, and URDU datasets. We followed previous research in mapping labels for all of the datasets to just two classes: positive and negative. Thus, we can compare performance on different languages directly and combine languages for training and testing. In Experiment 1, monolingual SER trials were carried out using three classifiers, AlexNet, VGGE (a proposed variant of VGG), and ResNet50. The results, averaged for the three models, were very similar for ASED and RAVDESS, suggesting that Amharic and English SER are equally difficult. Similarly, German SER is more difficult, and Urdu SER is easier. In Experiment 2, we trained on one language and tested on another, in both directions for each of the following pairs: Amharic↔German, Amharic↔English, and Amharic↔Urdu. The results with Amharic as the target suggested that using English or German as the source gives the best result. In Experiment 3, we trained on several non-Amharic languages and then tested on Amharic. The best accuracy obtained was several percentage points greater than the best accuracy in Experiment 2, suggesting that a better result can be obtained when using two or three non-Amharic languages for training than when using just one non-Amharic language. Overall, the results suggest that cross-lingual and multilingual training can be an effective strategy for training an SER classifier when resources for a language are scarce.

show abstract

Section: Architectures and Settingsmentioning

confidence: 99%

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

Retta,

Sutcliffe,

Mahmood

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…MFCC is widely used to analyze any speech signal and had performed well for speech-based emotion recognition systems compared to other features. In [92], MFCC feature extraction is used, and 39 coefficients are extracted. Long Short-Term Memory (LSTM) is implemented for emotion recognition.…”

Section: Review Of Speech Emotion Recognitionmentioning

confidence: 99%

A Comprehensive Review of Speech Emotion Recognition Systems

et al. 2021

View full text Add to dashboard Cite

During the last decade, Speech Emotion Recognition (SER) has emerged as an integral component within Human-computer Interaction (HCI) and other high-end speech processing systems. Generally, an SER system targets the speaker's existence of varied emotions by extracting and classifying the prominent features from a preprocessed speech signal. However, the way humans and machines recognize and correlate emotional aspects of speech signals are quite contrasting quantitatively and qualitatively, which present enormous difficulties in blending knowledge from interdisciplinary fields, particularly speech emotion recognition, applied psychology, and human-computer interface. The paper carefully identifies and synthesizes recent relevant literature related to the SER systems' varied design components/methodologies, thereby providing readers with a state-of-the-art understanding of the hot research topic. Furthermore, while scrutinizing the current state of understanding on SER systems, the research gap's prominence has been sketched out for consideration and analysis by other related researchers, institutions, and regulatory bodies.

show abstract

“…Unlike most of the previous studies in the literature, the method of our feature fusion was inspired by the way that conventional speech features (e.g., Mel-Frequency Cepstral Coefficients (MFCCs)) are computed. That is, 32D Low-Level Descriptor (LLD) features, including 12D Chroma [22] and 20D MFCC [23], are extracted. The High-Level Statistical Functions (HSF), such as the mean of Chroma and the mean, variance, and maximum of MFCC, are calculated accordingly.…”

Section: Emotion Feature Extractionmentioning

confidence: 99%

“…Each frame was Z-normalized. To each frame, 32D Low-Level Descriptor (LLD) features, including 12D Chroma [23] and 20D MFCC [24], were extracted. The High-Level Statistical Functions (HSF), such as the mean of Chroma and the mean, variance, and maximum of MFCC, were calculated.…”

Section: Feature Extractionmentioning

confidence: 99%

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

2021

View full text Add to dashboard Cite

Speech emotion recognition is a substantial component of natural language processing (NLP). It has strict requirements for the effectiveness of feature extraction and that of the acoustic model. With that in mind, a Heterogeneous Parallel Convolution Bi-LSTM model is proposed to address the challenges. It consists of two heterogeneous branches: the left one contains two dense layers and a Bi-LSTM layer, while the right one contains a dense layer, a convolution layer, and a Bi-LSTM layer. It can exploit the spatiotemporal information more effectively, and achieves 84.65%, 79.67%, and 56.50% unweighted average recalls on the benchmark databases EMODB, CASIA, and SAVEE, respectively. Compared with the previous research results, the proposed model achieves better performance stably.

show abstract

Speech Emotion Recognition using MFCC features and LSTM network

Cited by 33 publications

References 5 publications

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

Cross-Corpus Multilingual Speech Emotion Recognition: Amharic vs. Other Languages

A Comprehensive Review of Speech Emotion Recognition Systems

A Novel Heterogeneous Parallel Convolution Bi-LSTM for Speech Emotion Recognition

Contact Info

Product

Resources

About