In this paper, we propose a global approach for speech emotion recognition (SER) system using empirical mode decomposition (EMD). Its use is motivated by the fact that the EMD combined with the Teager-Kaiser Energy Operator (TKEO) gives an efficient time-frequency analysis of the non-stationary signals. In this method, each signal is decomposed using EMD into oscillating components called intrinsic mode functions (IMFs). TKEO is used for estimating the time-varying amplitude envelope and instantaneous frequency of a signal that is supposed to be Amplitude Modulation-Frequency Modulation (AM-FM) signal. A subset of the IMFs was selected and used to extract features from speech signal to recognize different emotions. The main contribution of our work is to extract novel features named modulation spectral (MS) features and modulation frequency features (MFF) based on AM-FM modulation model and combined them with cepstral features. It is believed that the combination of all features will improve the performance of the emotion recognition system. Furthermore, we examine the effect of feature selection on SER system performance. For classification task, Support Vecto Machine (SVM) and Recurrent Neural Networks (RNN) are used to distinguish seven basic emotions. Two databases-the Berlin corpus, and the Spanish corpusare used for the experiments. The results evaluated on the Spanish emotional database, using RNN classifier and a combination of all features extracted from the IMFs enhances the performance of the SER system and achieving 91.16 % recognition rate. For the Berlin database, the combination of all features using SVM classifier has 86.22% recognition rate.
This chapter presents a comparative study of speech emotion recognition (SER) systems. Theoretical definition, categorization of affective state and the modalities of emotion expression are presented. To achieve this study, an SER system, based on different classifiers and different methods for features extraction, is developed. Mel-frequency cepstrum coefficients (MFCC) and modulation spectral (MS) features are extracted from the speech signals and used to train different classifiers. Feature selection (FS) was applied in order to seek for the most relevant feature subset. Several machine learning paradigms were used for the emotion classification task. A recurrent neural network (RNN) classifier is used first to classify seven emotions. Their performances are compared later to multivariate linear regression (MLR) and support vector machines (SVM) techniques, which are widely used in the field of emotion recognition for spoken audio signals. Berlin and Spanish databases are used as the experimental data set. This study shows that for Berlin database all classifiers achieve an accuracy of 83% when a speaker normalization (SN) and a feature selection are applied to the features. For Spanish database, the best accuracy (94 %) is achieved by RNN classifier without SN and with FS.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.