Can vocal emotions be emulated? This question has been a recurrent concern of the speech community, and has also been vigorously investigated. It has been fueled further by its link to the issue of validity of acted emotion databases. Much of the speech and vocal emotion research has relied on acted emotion databases as valid proxies for studying natural emotions. To create models that generalize to natural settings, it is crucial to work with valid prototypes -ones that can be assumed to reliably represent natural emotions. More concretely, it is important to study emulated emotions against natural emotions in terms of their physiological, and psychological concomitants. In this paper, we present an onscale systematic study of the differences between natural and acted vocal emotions. We use a self-attention based emotion classification model to understand the phonetic bases of emotions by discovering the most 'attentive' phonemes for each class of emotions. We then compare these attentive phonemes in their importance and distribution across acted and natural classes. Our conclusions show significant differences in the manner and choice of phonemes in acted and natural speech, concluding moderate to low validity and value in using acted speech databases for emotion classification tasks.
Traditionally, in paralinguistic analysis for emotion detection from speech, emotions have been identified with discrete or dimensional (continuous-valued) labels. Accordingly, models that have been proposed for emotion detection use one or the other of these label types. However, psychologists like Russell and Plutchik have proposed theories and models that unite these views, maintaining that these representations have shared and complementary information. This paper is an attempt to validate these viewpoints computationally. To this end, we propose a model to jointly predict continuous and discrete emotional attributes and show how the relationship between these can be utilized to improve the robustness and performance of emotion recognition tasks. Our approach comprises multi-task and hierarchical multi-task learning frameworks that jointly model the relationships between continuous-valued and discrete emotion labels. Experimental results on two widely used datasets (IEMOCAP and MSPPodcast) for speech-based emotion recognition show that our model results in statistically significant improvements in performance over strong baselines with non-unified approaches. We also demonstrate that using one type of label (discrete or continuousvalued) for training improves recognition performance in tasks that use the other type of label. Experimental results and reasoning for this approach (called mis-matched training approach) are also presented.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.