For the purpose of constructing a naturalistic emotional speech database, a novel paradigm of collecting naturalistic emotional speech during a spontaneous Japanese dialog was proposed. The proposed paradigm was assessed by investigating whether the collected speech contains and conveys rich emotions psychologically and acoustically. To encourage speakers to experience and express their natural and vivid emotions, a Massively Multiplayer Online Role-Playing Game (MMORPG) was adopted as a task for speakers. They were asked to play the MMORPG together while discussing strategies to achieve their tasks through a voice chat system. The recording was performed for one hour per speaker. The total recording time was approximately 14 hours. The results of emotional labeling for the collected speech supported the validity of the paradigm showing higher interlabeler agreement than the chance levels. In addition, it was revealed that the paradigm is superior in the quantity of emotional speech to other paradigm by showing a significantly higher rate of labeling instances for our speech material (73%, 2 ð2Þ ¼ 27659:87, p < 0:001) than other speech materials. Finally, an acoustical analysis supported the validity of the paradigm, showing a significant difference between the nonemotional utterances and the emotional utterances (p < 0:05).
To give synthetic speech richer expression, prosodic features of utterances with various kinds of emotions were analyzed. Utterances that express four basic emotions with several degrees were collected as speech material: joy, sadness, anger, and fear. The fundamental frequency contours are analyzed based on a model for the process of generation. Changes in controlling parameters of the model were examined with regard to degrees of emotion. The baseline frequency increases as degrees of emotion increase. Especially for sadness and anger, its tendency is remarkable. Regarding phrase commands, the rate of occurrence increases as the respective degrees of emotion increase. The rates are affected by the kind of branch boundary in the grammatical structure and the number of morae from the immediately preceding phrase command. The change of the amplitude of phrase commands depends on the kind of position of grammatical structure. For the accent commands, timings of their onsets and offsets are almost constant for degrees of emotion. They depend on the accent types of prosodic words. The magnitude of the accent commands changes as degrees of emotion increase depending on the positions of prosodic words from the beginning of the utterance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.