The detection of different levels of physical load from speech has many applications: Besides telemedicine, non-contact detection of certain heart rate ranges can be useful for sports and other leisure time devices. Available approaches mainly use a high number of spectral and prosodic features. In this setting of typically small data sets, such as the Talk & Run data set and the Munich Biovoice Corpus, the high-dimensional feature spaces are only sparsely populated. Therefore, we aim at a reduction of the feature number using modern neural net inspired features: Bottleneck layer features, obtained from standard low-level descriptors via a feed-forward neural network, and activation map features, obtained from spectrograms via a convolutional neural network. We use these features for an SVM classification of high and low physical load and compare their performance. We also discuss the possibility of hyperparameter transfer of the extracting networks between different data sets. We show that even for limited amounts of data, deep learning based methods can bring a substantial improvement over "conventional" features.
Typical current assistance systems often take the form of optimised user interfaces between the user interest and the capabilities of the system. In contrast, a peer-like system should be capable of independent decision-making capabilities, which in turn require an understanding and knowledge of the current situation for performing a sensible decision-making process. We present a method for a system capable of interacting with their user to optimise their information-gathering task, while at the same time ensuring the necessary satisfaction with the system, so that the user may not be discouraged from further interaction. Based on this collected information, the system may then create and employ a specifically adapted rule-set base which is much closer to an intelligent companion than a typical technical user interface. A further aspect is the perception of the system as a trustworthy and understandable partner, allowing an empathetic understanding between the user and the system, leading to a closer integrated smart environment.
ObjectiveAcoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural interaction on the same level as human interactions, many studies focused on the acoustic analyses of speech. The aim of this survey is to give an overview on the different studies and compare them in terms of utilized features, datasets, as well as classification architectures, which has so far been not conducted.MethodsThe survey followed the Preferred Reporting Items for Systematic reviews and Meta-Analysis (PRISMA) guidelines. We included all studies which were analyzing acoustic and/or acoustic characteristics of speech utterances to automatically detect the addressee. For each study, we describe the used dataset, feature set, classification architecture, performance, and other relevant findings.Results1,581 studies were screened, of which 23 studies met the inclusion criteria. The majority of studies utilized German or English speech corpora. Twenty-six percent of the studies were tested on in-house datasets, where only limited information is available. Nearly 40% of the studies employed hand-crafted feature sets, the other studies mostly rely on Interspeech ComParE 2013 feature set or Log-FilterBank Energy and Log Energy of Short-Time Fourier Transform features. 12 out of 23 studies used deep-learning approaches, the other 11 studies used classical machine learning methods. Nine out of 23 studies furthermore employed a classifier fusion.ConclusionSpeech-based automatic addressee detection is a relatively new research domain. Especially by using vast amounts of material or sophisticated models, device-directed speech is distinguished from non-device-directed speech. Furthermore, a clear distinction between in-house datasets and pre-existing ones can be drawn and a clear trend toward pre-defined larger feature sets (with partly used feature selection methods) is apparent.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.