A task-performance evaluation of referring expressions in situated collaborative task dialogues

Spanger, Philipp; Iida, Ryu; Tokunaga, Takenobu; Terai, Asuka; Kuriyama, Naoko

doi:10.1007/s10579-013-9240-5

Cited by 1 publication

(1 citation statement)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Even though such evaluations are more logistically difficult to run in practice, due to the necessity of recruiting participants and deploying a robust, fully interactive system, there is really no alternative: at present, corpus similarity is an inadequate metric for any evaluation of a reference-generation system intended for use in an interactive context. Note that Spanger, Iida, Tokunaga, Terai, and Kuriyama (2013) recently came to a similar conclusion regarding the evaluation of referring expressions in the context of collaborative situated dialogues.…”

Section: Discussionmentioning

confidence: 59%

Task-based evaluation of context-sensitive referring expressions in human–robot dialogue

Foster

Giuliani

Isard

2013

Language, Cognition and Neuroscience

View full text Add to dashboard Cite

The standard referring-expression generation task involves creating stand-alone descriptions intended solely to distinguish a target object from its context. However, when an artificial system refers to objects in the course of interactive, embodied dialogue with a human partner, this is a very different setting: the references found in situated dialogue are able to take into account aspects of the physical, interactive, and task-level context, and are therefore unlike those found in corpora of stand-alone references. Also, the dominant method of evaluating generated references involves measuring corpus similarity. In an interactive context, though, other extrinsic measures such as task success and user preference are more relevant-and numerous studies have repeatedly found little or no correlation between such extrinsic metrics and the predictions of commonly used corpus-similarity metrics.To explore these issues, we introduce a humanoid robot designed to co-operate with a human partner on a joint construction task. We then describe the context-sensitive reference-generation algorithm that was implemented for use on this robot, which was inspired by the referring phenomena found in the Joint Communication Task corpus of human-human joint construction dialogues. The context-sensitive algorithm was evaluated through two user studies comparing it to a baseline algorithm, using a combination of objective performance measures and subjective user satisfaction scores. In both studies, the objective task performance and dialogue quality were found to be the same for both versions of the system; however, in both cases, the context-sensitive system scored more highly on subjective measures of interaction quality. The generation of referring expressions (GRE) is one of the most clearly defined sub-tasks in natural language generation (NLG), and is therefore one of the tasks that has received the most attention. The classic GRE task involves creating an initial, stand-alone description intended solely to distinguish the target from any "distractors" in the area, and the dominant method of evaluating such systems involves measuring the similarity of the generated references to those drawn from a suitable corpus of human-generated descriptions.In this paper, we consider referring expressions in the context of joint action in a shared workspace, where the dialogue takes place between a human and a humanoid robot. The robot was designed to co-operate with a human partner on a joint construction task, so the referring expressions which it generates take account of the discourse and physical context in which it operates, drawing inspiration from the referring phenomena found in a corpus of human-human dialogues in a similar joint construction scenario.In this interactive, embodied context, the most important measures of success are extrinsic measures such as task success and user subjective opinions-and it is known that the predictions of the typical intrinsic corpus-based evaluation strategies do not tend to correlate w...

show abstract

Section: Discussionmentioning

confidence: 59%