In recent years, automatic recognition of human's emotion from speech has become one of the most important research areas, which can improve man-machine interaction. In this study, we proposed new features derived from reconstructed phase space (RPS) of speech. To this end, the RPS is uniformly divided into non-overlapping discrete cells and the number of points included in each cell is counted to form the proposed feature vector. Then multiple classifiers were examined to classify speech samples according to their emotional states. Our experimental results have demonstrated the potential and promise of proposed RPS based features as a useful combination for standard prosodic and spectral features. The best average recognition rate of 89.34% was obtained for classifying seven emotion categories in the Berlin database using a support vector machine with both radial basis function and polynomial kernels.Keywords: speech emotion recognition; reconstructed phase space.
INTRODUCTIONSpeaking is the fastest and one of the most important communication means of human beings. Now a days, automatic speech recognition (ASR) systems are extensively used in man-machine interaction. However, human speech is generally embedded with emotions to convey the intended message. From this fact arises a new multidisciplinary research area known as Speech emotion recognition (SER). Despite of widespread efforts done in SER, there are many challenging problems that need to be solve in order to improve the performance of SER systems [1]. Based on contradictory reports on the effect of emotions on some of acoustic attributes, specifying effective features is still the major unsolved problem.To solve the above problem, many prosodic and spectral features have been presented. Prosodic features appear when sounds are put together in connected speech and they are mainly deal with intonation, stress and rhythm of speech. It has been shown that prosodic features, which are widely used in SER, suggest important emotional cues of the speaker [1][2][3][4][5][6][7]. In theliterature, pitch, duration, energy and their derivatives are widely used to representprosodic features [1,8]. Features extracted from the spectrum of speech are generally called spectral. These features convey the frequency contents of the signal and provide complementary information for prosodic features Both prosodic and spectral features are generally computed by the traditional linear source-filter model of the human speech production system [19]. Unfortunately, such a model cannot convey nonlinear 3D fluid dynamics phenomena of speech [20,21]. In order to fill the existing gap between this ideal linear deterministic model and real strongly unpredictable speech production process, non-linear processing techniques can be used [22,23,24,25].In recent years, reconstructed phase space (RPS) of speech has been used for speech recognition [26,27], speech enhancement [26,27]and detecting sleepiness [30]. Moreover, nonlinear dynamics features extracted from RPS of speech has b...