Speech Emotion Recognition (SER) is an important and challenging task for human-computer interaction. In the literature deep learning architectures have been shown to yield state-ofthe-art performance on this task when the model is trained and evaluated on the same corpus. However, prior work has indicated that such systems often yield poor performance on unseen data. To improve the generalisation capabilities of emotion recognition systems one possible approach is cross-corpus training, which consists of training the model on an aggregation of different corpora. In this paper we present an analysis of the generalisation capability of deep learning models using crosscorpus training with six different speech emotion corpora. We evaluate the models on an unseen corpus and analyse the learned representations using the t-SNE algorithm, showing that architectures based on recurrent neural networks are prone to overfit the corpora present in the training set, while architectures based on convolutional neural networks (CNNs) show better generalisation capabilities. These findings indicate that (1) cross-corpus training is a promising approach for improving generalisation and (2) CNNs should be the architecture of choice for this approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.