How NOT To Evaluate Your Dialogue System: An Empirical Study of
            Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia-Wei; Lowe, Ryan; Serban, Iulian Vlad; Noseworthy, Michael D.; Charlin, Laurent; Pineau, Joëlle

doi:10.18653/v1/d16-1230

Cited by 875 publications

(878 citation statements)

References 32 publications

Supporting

Mentioning

856

Contrasting

Unclassified

Order By: Relevance

“…We also conduct an analysis of the response data from (Liu et al, 2016), where the pre-processing is standardized by removing '<first speaker>' tokens at the beginning of each utterance. The results are detailed in the supplemental material.…”

Section: Resultsmentioning

confidence: 99%

“…The most widely used metric for evaluating such dialogue systems is BLEU (Papineni et al, 2002), a metric measuring word overlaps originally developed for machine translation. However, it has been shown that BLEU and other word-overlap metrics are biased and correlate poorly with human judgements of response quality (Liu et al, 2016). There are many obvious cases where these metrics fail, as they are often incapable of considering the semantic similarity between responses (see Figure 1).…”

Section: Context Of Conversationmentioning

confidence: 99%

“…To achieve this variety, we use candidate responses from several different models. Following (Liu et al, 2016), we use the following 4 sources of candidate responses: (1) a response selected by a TF-IDF retrieval-based model, (2) a response selected by the Dual Encoder (DE) (Lowe et al, 2015), (3) a response generated using the hierarchical recurrent encoder-decoder (HRED) model (Serban et al, 2016a), and (4) human-generated responses. It should be noted that the humangenerated candidate responses are not the reference responses from a fixed corpus, but novel human responses that are different from the reference.…”

Section: Data Collectionmentioning

confidence: 99%

See 2 more Smart Citations

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Lowe¹,

Noseworthy²,

Serban³

et al. 2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

Self Cite

284

265

View full text Add to dashboard Cite

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem. We present an evaluation model (ADEM) that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model's predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and systemlevel. We also show that ADEM can generalize to evaluating dialogue models unseen during training, an important step for automatic dialogue evaluation.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Context Of Conversationmentioning

confidence: 99%

Section: Data Collectionmentioning

confidence: 99%

See 1 more Smart Citation

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Lowe¹,

Noseworthy²,

Serban³

et al. 2017

Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers)

Self Cite

284

265

View full text Add to dashboard Cite

show abstract

“…Our goal is to take advantage of this feature not only for evaluation, but also for the system's actual design. As far as the evaluation of unsupervised response generation systems goes, this is a challenging area of research in its own right [19,18].…”

Section: Related Workmentioning

confidence: 99%

Boosting a Rule-Based Chatbot Using Statistics and User Satisfaction Ratings

Efraim

Maraev

Rodrigues

2017

Communications in Computer and Information Science

View full text Add to dashboard Cite

Abstract. Using data from user-chatbot conversations where users have rated the answers as good or bad, we propose a more efficient alternative to a chatbot's keyword-based answer retrieval heuristic. We test two neural network approaches to the near-duplicate question detection task as a first step towards a better answer retrieval method. A convolutional neural network architecture gives promising results on this difficult task.

show abstract

“…This is also the case in open-domain dialogue systems, in which common evaluation metrics like BLEU (Papineni et al, 2002) are only weakly correlated with human judgments (Liu et al, 2016). Another problem with metrics like BLEU is the dependence on a gold standard.…”

Section: Introductionmentioning

confidence: 99%

Quality Signals in Generated Stories

Sagarkar¹,

Wieting

et al. 2018

Proceedings of the Seventh Joint Conference on Lexical And Computational Semantics

View full text Add to dashboard Cite

We study the problem of measuring the quality of automatically-generated stories. We focus on the setting in which a few sentences of a story are provided and the task is to generate the next sentence ("continuation") in the story. We seek to identify what makes a story continuation interesting, relevant, and have high overall quality. We crowdsource annotations along these three criteria for the outputs of story continuation systems, design features, and train models to predict the annotations. Our trained scorer can be used as a rich feature function for story generation, a reward function for systems that use reinforcement learning to learn to generate stories, and as a partial evaluation metric for story generation.

show abstract

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Cited by 875 publications

References 32 publications

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Boosting a Rule-Based Chatbot Using Statistics and User Satisfaction Ratings

Quality Signals in Generated Stories

Contact Info

Product

Resources

About