One of the obstacles in developing speech emotion recognition (SER) systems is the data scarcity problem, i.e., the lack of labeled data for training these systems. Data augmentation is an effective method for increasing the amount of training data. In this paper, we propose a cycle generative adversarial network (Cycle-GAN) for data augmentation in the SER systems. For each of the five emotions considered, an adversarial network is designed to generate data that has a similar distribution to the main data in that class but has a different distribution to those of other classes. These networks are trained in an adversarial way to produce feature vectors similar to those in the training set, which are then added to the original training sets. Instead of using the common cross-entropy loss to train Cycle-GANs, we use the Wasserstein divergence to mitigate the gradient vanishing problem and to generate high-quality samples. The proposed network has been applied to SER using the EMO-DB dataset. The quality of the generated data is evaluated using two classifiers based on Support Vector Machine (SVM)
Nowadays, and with the mechanization of life, speech processing has become so crucial for the interaction between humans and machines. Deep neural networks require a database with enough data for training. The more features are extracted from the speech signal, the more samples are needed to train these networks. Adequate training of these networks can be ensured when there is access to sufficient and varied data in each class. If there is not enough data; it is possible to use data augmentation methods to obtain a database with enough samples. One of the obstacles to developing speech emotion recognition systems is the Data sparsity problem in each class for neural network training. The current study has focused on making a cycle generative adversarial network for data augmentation in a system for speech emotion recognition. For each of the five emotions employed, an adversarial generating network is designed to generate data that is very similar to the main data in that class, as well as differentiate the emotions of the other classes. These networks are taught in an adversarial way to produce feature vectors like each class in the space of the main feature, and then they add to the training sets existing in the database to train the classifier network. Instead of using the common cross-entropy error to train generative adversarial networks and to remove the vanishing gradient problem, Wasserstein Divergence has been used to produce high-quality artificial samples. The suggested network has been tested to be applied for speech emotion recognition using EMODB as training, testing, and evaluating sets, and the quality of artificial data evaluated using two Support Vector Machine (SVM) and Deep Neural Network (DNN) classifiers. Moreover, it has been revealed that extracting and reproducing high-level features from acoustic features, speech emotion recognition with separating five primary emotions has been done with acceptable accuracy.
Thus far, it has been unknown whether feature selection methods succeed in increasing the efficiency of speech-emotion recognition systems. This article discusses and evaluates feature selection for data augmentation purposes in a speech emotion recognition system. This study performed the experiments using Python and on four common databases: EMODB,eNTERFACE05, SAVEE, and IEMOCAP. Data analysis was conducted on all four databases for five emotions: sadness, fear, anger, happiness, andneutral. A support vector machine was used to classify emotions. We also used a generative adversarial network to augment data and two feature selection networks, Fisher and Linear Discriminant Analysisalgorithms. In two steps and with the feedback from the classification network, we could bring the speech emotion recognition to an optimal point in sample number and feature vector dimensions. The results showed that using Linear Discriminant Analysis and the Fisher method simultaneously in the generative adversarial networks can remove redundant and irrelevant features while preserving features with important emotional information for classification. The results obtained from the proposed method were compared with that of recent studies. The proposed method was able to achieve 86.32% accuracy in the Berlin Database of Emotional Speech.
Thus far, it has been unknown whether feature selection methods succeed in increasing the efficiency of speech-emotion recognition systems. This article discusses and evaluates feature selection for data augmentation purposes in a speech emotion recognition system. This study performed the experiments using Python and on four common databases: EMO-DB, eNTERFACE05, SAVEE, and IEMOCAP. Data analysis was conducted on all four databases for five emotions: sadness, fear, anger, happiness, and neutral. A support vector machine was used to classify emotions. We also used a generative adversarial network to augment data and two feature selection networks including Fisher and Linear Discriminant Analysis algorithms. In two steps and with the feedback from the classification network, we could bring the speech emotion recognition to an optimal point in sample number and feature vector dimensions. The results showed that using Linear Discriminant Analysis and the Fisher method simultaneously in the generative adversarial networks can remove redundant and irrelevant features while preserving features with important emotional information for classification. The results obtained from the proposed method were compared with that of recent studies. The proposed method was able to achieve 86.32\% accuracy in the Berlin Database of Emotional Speech.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.