Proceedings of the 17th Annual Meeting of the Special Interest Group on Discourse and Dialogue 2016
DOI: 10.18653/v1/w16-3634
|View full text |Cite
|
Sign up to set email alerts
|

On the Evaluation of Dialogue Systems with Next Utterance Classification

Abstract: An open challenge in constructing dialogue systems is developing methods for automatically learning dialogue strategies from large amounts of unlabelled data. Recent work has proposed NextUtterance-Classification (NUC) as a surrogate task for building dialogue systems from text data. In this paper we investigate the performance of humans on this task to validate the relevance of NUC as a method of evaluation. Our results show three main findings: (1) humans are able to correctly classify responses at a rate mu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
36
0
2

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 39 publications
(38 citation statements)
references
References 19 publications
0
36
0
2
Order By: Relevance
“…Though this task was not originally part of the MultiWoz dataset, we construct the necessary data for this task by randomly sampling negative examples. This task is underlined by Lowe et al (2016)'s suggestion that using NUR for evaluation is extremely indicative of performance and is one of the best forms of evaluation. Hits@1 (H@1) is used to evaluate our retrieval models.…”
Section: Next-utterance Retrievalmentioning
confidence: 99%
“…Though this task was not originally part of the MultiWoz dataset, we construct the necessary data for this task by randomly sampling negative examples. This task is underlined by Lowe et al (2016)'s suggestion that using NUR for evaluation is extremely indicative of performance and is one of the best forms of evaluation. Hits@1 (H@1) is used to evaluate our retrieval models.…”
Section: Next-utterance Retrievalmentioning
confidence: 99%
“…The latter has users interact with a dialogue system by giving them a goal and asking them to evaluate the dialogue after its completion. It is unclear how to extend these methods to an open-ended domain as AMT workers are unlikely to have enough expertise to evaluate success or start a conversation on each possible dialogue topic (Lowe et al, 2016). Furthermore, although AMT is both faster and cheaper than running in-person experiments, we search for an automatic evaluation method with near zero costs.…”
Section: Related Workmentioning
confidence: 99%
“…The role of human judgements in such settings is nonetheless purely evaluative: the judge assesses post hoc the quality of a small sample of the system output according to some relevancy criterion. In contrast to these experiments, ours is not an unsupervised response generation system, but a supervised retrieval-based system, as defined in [19], insofar as it does "explicitly incorporate some supervised signal such as task completion or user satisfaction". Our goal is to take advantage of this feature not only for evaluation, but also for the system's actual design.…”
Section: Related Workmentioning
confidence: 99%
“…Our goal is to take advantage of this feature not only for evaluation, but also for the system's actual design. As far as the evaluation of unsupervised response generation systems goes, this is a challenging area of research in its own right [19,18].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation