Response Timing Detection Using Prosodic and Linguistic Information for Human-friendly Spoken Dialog Systems

Kitaoka, Norihide; Takeuchi, Masashi; Nishimura, Ryota; Nakagawa, Seiichi

doi:10.1527/tjsai.20.220

Cited by 36 publications

(36 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Kitaoka et al used first-order regression coefficients of pitch and power contours to describe patterns and generate response timing [29]. Nishimura et al pointed out that both the last short regions and the longer ones contained information which triggered backchannel responses [30].…”

Section: Prosodic Featuresmentioning

confidence: 99%

Backchannel Prediction for Mandarin Human-Computer Interaction

Mao

Peng

Xue

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYIn recent years, researchers have tried to create unhindered human-computer interaction by giving virtual agents human-like conversational skills. Predicting backchannel feedback for agent listeners has become a novel research hot-spot. The main goal of this paper is to identify appropriate features and methods for backchannel prediction in Mandarin conversations. Firstly, multimodal Mandarin conversations are recorded for the analysis of backchannel behaviors. In order to eliminate individual difference in the original face-to-face conversations, more backchannels from different listeners are gathered together. These data confirm that backchannels occurring in the speakers' pauses form a vast majority in Mandarin conversations. Both prosodic and visual features are used in backchannel prediction. Four types of models based on the speakers' pauses are built by using support vector machine classifiers. An evaluation of the pause-based prediction model has shown relatively high accuracy in consideration of the optional nature of backchannel feedback. Finally, the results of the subjective evaluation validate that the conversations performed between humans and virtual listeners using backchannels predicted by the proposed models is more unhindered compared to other backchannel prediction methods. key words : human-computer interaction, virtual agent, backchannel, Mandarin, support vector machine

show abstract

Section: Prosodic Featuresmentioning

confidence: 99%

Backchannel Prediction for Mandarin Human-Computer Interaction

Mao

Peng

Xue

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

show abstract

“…† † http://www.cs.waikato.ac.nz/ml/weka/ (14)- (18) We exploited five features that may be effective for our task by referring to previous work on turn-taking decision [10]- [12]. There might be other features that are effective, but exploring such features is among the future work.…”

Section: Featuresmentioning

confidence: 99%

“…Ohsuga et al identified prosodic features that are helpful for determining ends of turns with decision tree learning on the Japanese Map Task Corpus [11]. Kitaoka et al also used both prosodic and linguistic information to determine timing of system response generation [12]. Edlund et al developed a prosodic analysis tool to augment end-point detection [13].…”

Section: Related Workmentioning

confidence: 99%

Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

Komatani

Hotta

Sato

et al. 2015

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYAppropriate turn-taking is important in spoken dialogue systems as well as generating correct responses. Especially if the dialogue features quick responses, a user utterance is often incorrectly segmented due to short pauses within it by voice activity detection (VAD). Incorrectly segmented utterances cause problems both in the automatic speech recognition (ASR) results and turn-taking: i.e., an incorrect VAD result leads to ASR errors and causes the system to start responding though the user is still speaking. We develop a method that performs a posteriori restoration for incorrectly segmented utterances and implement it as a plug-in for the MMDAgent open-source software. A crucial part of the method is to classify whether the restoration is required or not. We cast it as a binary classification problem of detecting originally single utterances from pairs of utterance fragments. Various features are used representing timing, prosody, and ASR result information. Experiments show that the proposed method outperformed a baseline with manually-selected features by 4.8% and 3.9% in cross-domain evaluations with two domains. More detailed analysis revealed that the dominant and domain-independent features were utterance intervals and results from the Gaussian mixture model (GMM). key words: spoken dialogue system, VAD error, turn taking, a posteriori restoration

show abstract

“…Switching pauses [4], which are defined as pauses between turns, have been regarded as a distinctive property of spoken dialogue as a form of social interaction [5,6]. Unlike pauses in monologues or intrapersonal pauses in dialogues, the duration of switching pauses has an aspect similar to that of reaction time.…”

Section: Introductionmentioning

confidence: 99%

“…Although some previous studies [11,12] reported the effects of emotional state on the duration of intrautterance pauses, they did not deal with switching pauses. Furthermore, because the generation of response timing has been treated as an independent module from the speech synthesizer in most spoken dialogue systems [6,13], the finely tuned modeling of switching pause duration taking paralinguistic effects into account has specific importance in the design of responsive human interfaces.…”

Section: Introductionmentioning

confidence: 99%

An analysis of switching pause duration as a paralinguistic feature in expressive dialogues

Mori

2009

Acoust. Sci. & Tech.

View full text Add to dashboard Cite

Response Timing Detection Using Prosodic and Linguistic Information for Human-friendly Spoken Dialog Systems

Cited by 36 publications

References 12 publications

Backchannel Prediction for Mandarin Human-Computer Interaction

Backchannel Prediction for Mandarin Human-Computer Interaction

Posteriori Restoration of Turn-Taking and ASR Results for Incorrectly Segmented Utterances

An analysis of switching pause duration as a paralinguistic feature in expressive dialogues

Contact Info

Product

Resources

About