The possibility of cultural differences in the fundamental acoustic patterns used to express emotion through the voice is an unanswered question central to the larger debate about the universality versus cultural specificity of emotion. This study used emotionally inflected standard-content speech segments expressing 11 emotions produced by 100 professional actors from 5 English-speaking cultures. Machine learning simulations were employed to classify expressions based on their acoustic features, using conditions where training and testing were conducted on stimuli coming from either the same or different cultures. A wide range of emotions were classified with above-chance accuracy in cross-cultural conditions, suggesting vocal expressions share important characteristics across cultures. However, classification showed an in-group advantage with higher accuracy in within- versus cross-cultural conditions. This finding demonstrates cultural differences in expressive vocal style, and supports the dialect theory of emotions according to which greater recognition of expressions from in-group members results from greater familiarity with culturally specific expressive styles.
Segmentation of speech signals is a crucial task in many types of speech analysis. We present a novel approach at segmentation on a syllable level, using a Bidirectional Long-Short-Term Memory Neural Network. It performs estimation of syllable nucleus positions based on regression of perceptually motivated input features to a smooth target function. Peak selection is performed to attain valid nuclei positions. Performance of the model is evaluated on the levels of both syllables and the vowel segments making up the syllable nuclei. The general applicability of the approach is illustrated by good results for two common databases-Switchboard and TIMIT-for both read and spontaneous speech, and a favourable comparison with other published results.
This paper presents our progress in developing a Virtual Human capable of being an attentive speaker. Such a Virtual Human should be able to attend to its interaction partner while it is speaking-and modify its communicative behavior on-the-fly based on what it observes in the behavior of its partner. We report new developments concerning a number of aspects, such as scheduling and interrupting multimodal behavior, automatic classification of listener responses, generation of response eliciting behavior, and strategies for generating appropriate reactions to listener This paper is base upon a project report of the eNTERFACE'10 Summer Workshop on Multimodal Interfaces [42].responses. On the basis of this progress, a task-based setup for a responsive Virtual Human was implemented to carry out two user studies, the results of which are presented and discussed in this paper.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.