This paper aims at recognizing emotions for a text-independent and speaker-independent emotion recognition system based on a novel classifier, which is a hybrid of a cascaded Gaussian mixture model and deep neural network (GMM-DNN). This hybrid classifier has been assessed for emotion recognition on ''Emirati speech database (Arabic United Arab Emirates Database)'' with six different emotions. The sequential GMM-DNN classifier has been contrasted with support vector machines (SVMs) and multilayer perceptron (MLP) classifiers, and its performance accuracy is indexed at 83.97%, while the other two perform at 80.33% and 69.78% using SVMs and MLP, respectively. These results demonstrate that the hybrid classifier significantly gives higher emotion recognition accuracy than SVMs and MLP classifiers. Our GMM-DNN model yields the results similar to those obtained by human judges in a subjective assessment context. Also, the performance of the classifier has been tested using two distinct emotional databases and in normal and noisy talking conditions. The dominant signal mask provided by the hybrid classifier offers better system performance in the presence of noisy signals. INDEX TERMS Deep neural network, emotion recognition, Gaussian mixture model.
This research is an effort to present an effective approach to enhance text-independent speaker identification performance in emotional talking environments based on novel classifier called cascaded Gaussian Mixture Model-Deep Neural Network (GMM-DNN). Our current work focuses on proposing, implementing and evaluating a new approach for speaker identification in emotional talking environments based on cascaded Gaussian Mixture Model-Deep Neural Network as a classifier. The results point out that the cascaded GMM-DNN classifier improves speaker identification performance at various emotions using two distinct speech databases:Emirati speech database (Arabic United Arab Emirates dataset) and "Speech Under Simulated and Actual Stress (SUSAS)" English dataset. The proposed classifier outperforms classical classifiers such as Multilayer Perceptron (MLP) and Support Vector Machine (SVM) in each dataset. Speaker identification performance that has been attained based on the cascaded GMM-DNN is similar to that acquired from subjective assessment by human listeners.2 Keywords: deep neural network; emotional talking environments; Gaussian mixture model; speaker identification.
IntroductionSpeaker recognition and its sub divisional entities: "speaker identification and speaker verification" need to be redefined on the basis of talking environments as the neutral talking environment and the emotional talking environments. Speaker recognition performance faces drastic challenges, especially when the speaker identity going through the human-computer interface in emotional talking environments. Speaker recognition applications in security systems are widening their base into banking sector, customer care sector, criminal investigation and can be used as security control measure to remotely access a server or for access to confidential library files on a server. The process of "automatic speaker identification and verification" in stressful and emotional talking environments is a challenging area of research [1]. "Speaker identification" is comprised of two schemes in terms of sets: "closed set" and "open set" speaker identification. When the unknown speaker is presumed to be one among the database of known speakers, it becomes the scheme of a "closed set", while in the scheme of an "open set", the unfamiliar speaker might not necessarily be from the database of familiar speakers. Operational procedure divides speaker identification into "text-dependent", where the same text is uttered by the speaker in the training and testing phases and "text-independent", where different texts are uttered by the speaker during the training and testing phases [2]. 3 A perfect communication from a speaker depends not only on linguistic statements but also on the emotional aspects of the speaker. Identifying the emotional aspects of the speaker by the machine is still a challenge of the human-machine interface. Speech is always a perfect mix of linguistic notes linked with emotion along with its paralinguistic features. Emotion recognition from ...
This research aims to design and implement an artificial emotional intelligence system that is capable of identifying the unknown emotion of the speaker. To that end, we propose a novel framework for emotion recognition in the presence of noise and interference. Our approach accounts for energy, time and spectral parameters to examine the emotion of the speaker. However, rather than using Gammatone filterbank and short-time Fourier transform (STFT), commonly adopted in the literature, we propose employing a novel wavelet packet transform (WPT) based cochlear filterbank. Our system, coupling this representation with random forest classifier, shows superior performance over other existing algorithms when appraised on three distinct speech corpora in two different languages, and considering also stressful and noisy talking conditions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.