Speech Emotion Based Sentiment Recognition using Deep Neural Networks

Choudhary, Ravi Raj; Meena, Gaurav; Mohbey, Krishna Kumar

doi:10.1088/1742-6596/2236/1/012003

Cited by 19 publications

(5 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Gradient Boosting excels with 84.96% accuracy on the merged dataset. The datasets RAVDESS and TESS datasets were integrated using CNN, yielding a 97.1% accuracy in [42]. RAVDESS, TESS and SAVEE datasets were integrated using neural network yielding a testing accuracy of about 89.26% in [43].…”

Section: Resultsmentioning

confidence: 99%

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Mohamed,

Koura,

Kayed

2024

IJACSA

View full text Add to dashboard Cite

Speech Emotion Recognition (SER) is a fastdeveloping area of study with a primary goal of automatically identifying and analyzing the emotional states expressed in speech. Emotions are crucial in human communication as they impact the effectiveness and meaning of linguistic expressions. SER aims to create computational approaches and models to detect and interpret emotions from speech signals. One of the primary applications of SER is evident in the field of Human-Computer Interaction (HCI), where it can be used to develop interactive systems that adapt to the user's emotional state based on their voice. This paper investigates the use of speech data for speech emotion recognition. Additionally, we applied a transformation process to convert the speech data into 2D images. Subsequently, we compared the outcomes of this transformation with the original speech data, aligning the comparison with a dataset containing labeled speech samples in both Arabic and English. Our experiments compare three methods: a transformer-based model, a Vision Transformer (ViT) based model, and a wave2vec-based model. The transformer model is trained from scratch on two significant audio datasets: the Arabic Natural Audio Dataset (ANAD) and the Toronto Emotional Speech Set (TESS), while the vision transformer is evaluated alongside wave2vec as part of transfer learning.The results are impressive. The transformer model achieved remarkable accuracies of 94% and 99% on ANAD and TESS datasets, respectively. Additionally, ViT demonstrates strong capabilities, achieving accuracies of 88% and 98% on the ANAD and TESS datasets, respectively. To assess the transfer learning potential, we also explore the Wave2Vector model with fine-tuning. However, the findings suggest limited success, achieving only a 56% accuracy rate on the ANAD dataset.

show abstract

Section: Resultsmentioning

confidence: 99%

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Mohamed,

Koura,

Kayed

2024

IJACSA

View full text Add to dashboard Cite

show abstract

“…5 shows the Random Forest architecture. Double randomness is the random forest's primary attribute [17]. Random Forest is renowned for its resistance to overfitting and capacity for handling large-scale, multidimensional data.…”

Section: Classification Methodsmentioning

confidence: 99%

“…MFCC features are based on human hearing perception [14]. The first step in the MFCC feature extraction procedure is to split the recorded speech data into brief frames, which typically last 20-40 milliseconds and have a minimal amount of overlap.…”

Section: B Extraction Of Featuresmentioning

confidence: 99%

A Machine Learning Approach for Emotion Classification in Bengali Speech

Islam,

Akhi,

Akter

et al. 2023

IJACSA

View full text Add to dashboard Cite

In this research work, we have presented a machine learning strategy for Bengali speech emotion categorization with a focus on Mel-frequency cepstral coefficients (MFCC) as features. The commonly utilized method of MFCC in speech processing has proved effective in obtaining crucial phoneme-specific data. This paper analyzes the efficacy of four machine learning algorithms: Random Forest, XGBoost, CatBoost, and Gradient Boosting, and tackles the paucity of research on emotion categorization in non-English languages, particularly Bengali. With CatBoost obtaining the greatest accuracy of 82.85%, Gradient Boosting coming in second with 81.19%, XGBoost coming in third with 80.03%, and Random Forest coming in fourth with 80.01%, experimental evaluation shows encouraging outcomes. MFCC features improve classification precision and offer insightful information on the distinctive qualities of emotions expressed in Bengali speech. By demonstrating how well MFCC characteristics can identify emotions in Bengali speech, this study advances the field of emotion classification. Future research can investigate more sophisticated feature extraction methods, look into how temporal dynamics are incorporated into emotion classification models, and investigate practical uses for emotion detection systems in Bengali speech. This study advances our knowledge of emotion classification and paves the way for more effective emotion identification systems in Bengali speech by utilizing MFCC and machine learning techniques. Our work addresses the need for thorough and efficient techniques to recognize and classify emotions in speech signals in the context of emotion categorization. Understanding emotions is essential for many applications, as they are a basic component of human communication. By investigating cutting-edge strategies that show promise for enhancing the precision and effectiveness of emotion recognition, this study advances the field of emotion classification.

show abstract

“…Koduru et al [19] investigated MFCCs, LPCs, and spectrograms with SVMs, achieving 81.25% on RAVDESS. Choudhary et al [20] proposed a CNN-RNN system with MFCCs, chroma, and tonnetz, reporting 87.5% on RAVDESS. Dutt and Gader [21] used wavelet decomposition with 1D CNN-LSTM, achieving 84.6% on RAVDESS.…”

Section: Literature Reviewmentioning

confidence: 99%

A Comparative Analysis of Deep Learning Models for Multi-class Speech Emotion Detection

Elango

2024

Preprint

View full text Add to dashboard Cite

In today's digital age, where communication transcends traditional boundaries, the exploration of deep learning models for Speech Emotion Recognition (SER) holds immense significance. As we increasingly interact through digital platforms, understanding and interpreting emotions becomes crucial. Deep learning models, with their ability to autonomously learn intricate patterns and representations, offer unparalleled potential in enhancing the accuracy and efficiency of SER systems. This project delves into models for multi-class speech emotion recognition on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). The RAVDESS dataset contains 1440 speech audio recordings from 24 professional actors, expressing 8 different emotions: neutral, calm, happy, sad, angry, fearful, surprise, and disgust. Models including Deep Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short-Term Memory networks (LSTMs), Gated Recurrent Units (GRUs), Temporal Convolutional Networks (TCNs), and ensembles were developed. Additionally, data augmentation through pitch shifting, noise injection, and a combination thereof expanded the dataset. Besides spectrogram inputs, handcrafted audio features like Mel Frequency Cepstral Coefficients (MFCCs), Chroma Short-time Fourier transform, root mean square, and zero crossing rate were experimented with as inputs to further boost model performance. The best-performing models were a Temporal Convolutional Network (TCN), achieving 96.88% testing accuracy, and a Gated Recurrent Unit (GRU) achieving 97.04% testing accuracy in classifying the 8 emotions, outperforming previous benchmark results on this dataset.

show abstract

Speech Emotion Based Sentiment Recognition using Deep Neural Networks

Cited by 19 publications

References 14 publications

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

Speech Emotion Recognition in Multimodal Environments with Transformer: Arabic and English Audio Datasets

A Machine Learning Approach for Emotion Classification in Bengali Speech

A Comparative Analysis of Deep Learning Models for Multi-class Speech Emotion Detection

Contact Info

Product

Resources

About