Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2016
DOI: 10.18653/v1/d16-1230
|View full text |Cite
|
Sign up to set email alerts
|

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Abstract: We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting spec… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

19
856
2
1

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 875 publications
(878 citation statements)
references
References 32 publications
19
856
2
1
Order By: Relevance
“…We also conduct an analysis of the response data from (Liu et al, 2016), where the pre-processing is standardized by removing '<first speaker>' tokens at the beginning of each utterance. The results are detailed in the supplemental material.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…We also conduct an analysis of the response data from (Liu et al, 2016), where the pre-processing is standardized by removing '<first speaker>' tokens at the beginning of each utterance. The results are detailed in the supplemental material.…”
Section: Resultsmentioning
confidence: 99%
“…The most widely used metric for evaluating such dialogue systems is BLEU (Papineni et al, 2002), a metric measuring word overlaps originally developed for machine translation. However, it has been shown that BLEU and other word-overlap metrics are biased and correlate poorly with human judgements of response quality (Liu et al, 2016). There are many obvious cases where these metrics fail, as they are often incapable of considering the semantic similarity between responses (see Figure 1).…”
Section: Context Of Conversationmentioning
confidence: 99%
See 1 more Smart Citation
“…Our goal is to take advantage of this feature not only for evaluation, but also for the system's actual design. As far as the evaluation of unsupervised response generation systems goes, this is a challenging area of research in its own right [19,18].…”
Section: Related Workmentioning
confidence: 99%
“…This is also the case in open-domain dialogue systems, in which common evaluation metrics like BLEU (Papineni et al, 2002) are only weakly correlated with human judgments (Liu et al, 2016). Another problem with metrics like BLEU is the dependence on a gold standard.…”
Section: Introductionmentioning
confidence: 99%