Language is an important tool in speech communication. Even without the understanding of one language, we can still judge the expressive content of a voice, such as happiness or sadness. However, sometimes misunderstanding of emotional communication occurs. It is not clear what the common/different features are that help or hinder people with different culture/native-languages background in making judgments about the expressivity of speech. In order to explore this question, we focus on Japanese and Taiwanese listeners perception of Japanese expressive speech utterances. We used the perceptual model proposed by [Huang and Akagi, InterSpeech 2005; 2007], which involves a concept called "semantic primitives"-- adjectives for describing voice perception. This concept simplifies and clarifies the discussion of common/different features in terms of acoustic cues and expressive speech perception categories. Using this model, we found some common/different features in the perception of expressive speech. Specifically, our results suggest that there may be primary and secondary semantic primitives associated with acoustic speech characteristics which are involved in the perception of expressive speech, and that people from different cultures/native-language background tend to use the same primary semantic primitives in perceiving expressive speech but different secondary ones.
Abstract. This paper reports rules for morphing a voice to make it be perceived as containing various primitive features, for example, to make it sound more "bright" or "dark". In a previous work we proposed a three-layered model, which contains emotional speech, primitive features, and acoustic features, for the perception of emotional speech. By experiments and acoustic analysis, we built the relationships between the three layers and reported that such relationships are significant. Then, a bottom-up method was adopted in order to verify the relationships. That is, we morphed (resynthesized) a speech voice by composing acoustic features in the bottommost layer to produce a voice in which listeners could perceive a single or multiple primitive features, which could be further perceived as different categories of emotion. The intermediate results show that the relationships of the model built in previous work are valid.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.