Speech carries information not only about the lexical content, but also about the age, gender, signature and emotional state of the speaker. Speech in different emotional states is accompanied by distinct changes in the production mechanism. In this chapter, we present a review of analysis methods used for emotional speech. In particular, we focus on the issues in data collection, feature representations and development of automatic emotion recognition systems. The significance of the excitation source component of speech production in emotional states is examined in detail. The derived excitation source features are shown to carry the emotion correlates.
IntroductionHumans have evolved various forms of communication like facial expressions, gestures, body postures, speech, etc. The form of communication depends on the context of interaction, and is often accompanied by various physiological reactions such as changes in the heart rate, skin resistance, temperature, muscle activity and blood pressure. All forms of human communication carry information at two levels, the message and the underlying emotional state.Emotions are essential part of real life communication among human beings. Various descriptions of the term emotion are studied in [21,22,60,88,92,98,100]. Some of the descriptions are: (a) "Emotions are underlying states which are evolved and adaptive. Emotion expressions are produced by the communicative value of underlying states" [22].
In generation of emotional speech, there are deviations in the speech production features when compared to neutral (non-emotional) speech. The objective of this study is to capture the deviations in features related to the excitation component of speech and to develop a system for automatic recognition of emotions based on these deviations. The emotions considered in this study are anger, happiness, sadness and neutral state. The study shows that there are useful features in the deviations of the excitation features, which can be exploited to develop an emotion recognition system. The excitation features used in this study are the instantaneous fundamental frequency (F 0), the strength of excitation, the energy of excitation and the ratio of the high-frequency to low-frequency band energy (β). A hierarchical binary decision tree approach is used to develop an emotion recognition system with neutral speech as reference. The recognition experiments showed that the excitation features are comparable or better than the existing prosody features and spectral features, such as mel-frequency cepstral coefficients, perceptual linear predictive coefficients and modulation spectral features.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.