Traditional dialogue systems use a fixed silence threshold to detect the end of users' turns. Such a simplistic model can result in system behaviour that is both interruptive and unresponsive, which in turn affects user experience. Various studies have observed that human interlocutors take cues from speaker behaviour, such as intonation, paralanguage, content, and gestures among others, to coordinate smooth exchange of speaking turns. However, hardly any effort has been made towards implementing these models in dialogue systems and verifying how well they model the turn-taking behaviour in human-computer interactions. In this paper, we present a data-driven approach to building models for online detection of suitable feedback response locations in the user's speech. We first collected human-computer interaction data using a spoken dialogue system that can perform the Map Task with users (albeit using a trick). Using this data we trained various models that use automatically extractable prosodic, contextual and lexico-syntactic features for online detection of feedback response locations. Next, we implemented a trained model in the same dialogue system and evaluated it in interactions with users. The results of a perception test of the user interactions support our hypothesis that the trained model provides for smoother dialogue in contrast to a baseline model. We also found that the trained model enhances the system's responsiveness in contrast to the baseline model. To our knowledge, this is the first work on actual verification of the proposals on using spea ker behavioural cues such as prosody, syntax and context, in modelling human-like turn-taking behaviour in spoken dialogue systems. Our results confirm that a model trained on these speaker modalities offers both smooth turn-transitions and responsive system behaviour.
Movie and TV subtitles contain large amounts of conversational material, but lack an explicit turn structure. This paper present a data-driven approach to the segmentation of subtitles into dialogue turns. Training data is first extracted by aligning subtitles with transcripts in order to obtain speaker labels. This data is then used to build a classifier whose task is to determine whether two consecutive sentences are part of the same dialogue turn. The approach relies on linguistic, visual and timing features extracted from the subtitles themselves and does not require access to the audiovisual material -although speaker diarization can be exploited when audio data is available. The approach also exploits alignments with related subtitles in other languages to further improve the classification performance. The classifier achieves an accuracy of 78 % on a held-out test set. A follow-up annotation experiment demonstrates that this task is also difficult for human annotators.
In this paper, we present a data-driven approach for detecting instances of miscommunication in dialogue system interactions. A range of generic features that are both automatically extractable and manually annotated were used to train two models for online detection and one for offline analysis. Online detection could be used to raise the error awareness of the system, whereas offline detection could be used by a system designer to identify potential flaws in the dialogue design. In experimental evaluations on system logs from three different dialogue systems that vary in their dialogue strategy, the proposed models performed substantially better than the majority class baseline models.
Abstract-We present an approach to enhance the interaction abilities of the Nao humanoid robot by extending its communicative behavior with non-verbal gestures (hand and head movements, and gaze following). A set of nonverbal gestures were identified that Nao could use for enhancing its presentation and turn-management capabilities in conversational interactions. We discuss our approach for modeling and synthesizing gestures on the Nao robot. A scheme for system evaluation that compares the values of users' expectations and actual experiences has been presented. We found that open arm gestures, head movements and gaze following could significantly enhance Nao's ability to be expressive and appear lively, and to engage human users in conversational interactions.
Abstract-The paper presents a multimodal conversational interaction system for the Nao humanoid robot. The system was developed at the 8th International Summer Workshop on Multimodal Interfaces, Metz, 2012. We implemented WikiTalk, an existing spoken dialogue system for open-domain conversations, on Nao. This greatly extended the robot's interaction capabilities by enabling Nao to talk about an unlimited range of topics. In addition to speech interaction, we developed a wide range of multimodal interactive behaviours by the robot, including facetracking, nodding, communicative gesturing, proximity detection and tactile interrupts. We made video recordings of user interactions and used questionnaires to evaluate the system. We further extended the robot's capabilities by linking Nao with Kinect.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.