More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its 'big brothers' speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech -the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts.Keywords: emotion, affect, automatic classification, feature types, feature selection, noise robustness, adaptation, standardisation, usability, evaluation Setting the SceneThis special issue will address new approaches towards dealing with the processing of realistic emotions in speech, and this overview article will give an account of the state-of-the-art, of the lacunas in this field, and of promising approaches towards overcoming shortcomings in modelling and recognising realistic emotions. We will also report on the first emotion challenge at INTERSPEECH 2009, constituting the initial impetus of this special issue; to end with, we want to sketch future strategies and applications, trying to answer the question 'Where to go from here?'The article is structured as follows: we first deal with the basic phenomenon briefly reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then proceed to automatic processing (sec. 2) including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the the first Emotion Challenge (sec. 3) including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the lessons learnt, before concluding this article (sec. 4).
As automatic emotion recognition based on speech matures, new challenges can be faced. We therefore address the major aspects in view of potential applications in the eld, to benchmark today's emotion recognition systems and bridge the gap between commercial interest and current performances: acted vs. spontaneous speech, realistic emotions, noise and microphone conditions, and speaker independence. Three different data-sets are used: the Berlin Emotional Speech Database, the Danish Emotional Speech Database, and the spontaneous AIBO Emotion Corpus. By using different feature types such as word-or turn-based statistics, manual versus forced alignment, and optimization techniques we show how to best cope with this demanding task and how noise addition or different microphone positions affect emotion recognition.
In this article, we describe and interpret a set of acoustic and linguistic features that characterise emotional/emotion-related user states -confined to the one database processed: four classes in a German corpus of children interacting with a pet robot. To this end, we collected a very large feature vector consisting of more than 4000 features extracted at different sites. We performed extensive feature selection (Sequential Forward Floating Search) for seven acoustic and four linguistic types of features, ending up in a small number of 'most important' features which we try to interpret by discussing the impact of different feature and extraction types. We establish different measures of impact and discuss the mutual influence of acoustics and linguistics.
In this chapter, we focus on the automatic recognition of emotional states using acoustic and linguistic parameters as features, and classifiers as tools to predict the 'correct' emotional states. We first sketch history and state-of-the art in this field; then we describe the process of 'corpus engineering', i.e. the design and recording of databases, the annotation of emotional states, and further processing such as manual or automatic segmentation. Next we present an overview of acous-
This paper investigates the automatic recognition of emotion from spoken words by vector space modeling vs. string kernels which have not been investigated in this respect, yet. Apart from the spoken content directly, we integrate Part-of-Speech and higher semantic tagging in our analyses. As opposed to most works in the field, we evaluate the performance with an ASR engine in the loop. Extensive experiments are run on the FAU Aibo Emotion Corpus of 4k spontaneous emotional child-robot interactions and show surprisingly low performance degradation with real ASR over transcriptionbased emotion recognition. In the result, bag of words dominate over all other modeling forms based on the spoken content, directly.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.