Speech emotion recognition (SER) is a challenging field of research that has attracted research during the last two decades. Successful performance of Deep Convolutional Neural Networks (DNNs) in various difficult pattern recognition problems motivates researchers to develop SER systems based on deep learning algorithms. The most essential requirement in training a deep model is the presence of a large-scale dataset. However, in many cases, such an amount of data is not available. Transfer learning approaches provide a practical solution to this problem. In this paper, we proposed an SER system based on AlexNet, the well-known deep model trained on the large-scale ImageNet dataset. In this way, the novel enriched spectrogram calculated based on the fusion of wide-band and narrow-band spectrograms is developed as a proper input for such a model. The proposed fused spectrogram benefited from both high temporal and spectral resolution. These images have been applied to the pre-trained AlexNet. All the experiments were performed on the popular Emo-DB, IEMOCAP, and eNTERFACE05 datasets based on 10-fold cross-validation and Leave-One-Speaker-Group-Out known as speaker-dependent and speaker-independent techniques, respectively. The proposed approach gains competent performance in contrast to other state-of-the-art methods.
Abstract-Recognition of emotion from speech is a significant subject in man-machine fields. In this study, speech signal has analyzed in order to create a recognition system which is able to recognize human emotion and a new set of characteristic has proposed in time, frequency and time-frequency domain in order to increase the accuracy. After extracting features of Pitch, MFCC, Wavelet, ZCR and Energy, neural networks classify four emotions of EMO-DB and SAVEE databases. Combination of features for two emotions in EMO-DB database is 100%, for three emotions is 98.48% and for four emotions is 90% due to the variety of speech, existing more spoken words and distinguishing male and female which is better than the result of SAVEE database. In SAVEE database, accuracy is 97.83% for two emotions of happy and sad, 84.75% for three emotions of angry, normal and sad and 77.78% for four emotions of happy, angry, sad and normal
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.