How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Liu, Chia-Wei; Lowe, Ryan; Serban, Iulian Vlad; Noseworthy, Michael D.; Charlin, Laurent; Pineau, Joëlle

doi:10.48550/arxiv.1603.08023

Cited by 430 publications

(193 citation statements)

References 21 publications

Supporting

Mentioning

185

Contrasting

Order By: Relevance

“…Automatic metrics Automatic metrics are the most convenient for fast, efficient and reproducible research with a quick turn-around and development cycle, hence they are frequently used. Unfortunately, many of them, such as BLEU, METEOR and ROUGE have been shown to only "correlate very weakly with human judgement" (Liu et al, 2016). A central problem is that due to the openended nature of conversations, there are many possible responses in a given dialogue, and, while having multiple references can help, there is typically only one gold label available (Gupta et al, 2019).…”

Section: Existing Workmentioning

confidence: 99%

“…Any comprehensive analysis of the performance of an open-domain conversational model must include human evaluations: automatic metrics can capture certain aspects of model performance but are no replacement for having human raters judge how adept models are at realistic and interesting conversation (Deriu et al, 2021;Liu et al, 2016;Dinan et al, 2019b). Unfortunately, human evaluations themselves must be carefully constructed in order to capture all the aspects desired of a good conversationalist.…”

Section: Introductionmentioning

confidence: 99%

“…

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al, 2016), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs.

…”

mentioning

confidence: 99%

See 2 more Smart Citations

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Smith¹,

Hsu²,

Qian³

et al. 2022

Preprint

View full text Add to dashboard Cite

At the heart of improving conversational AI is the open problem of how to evaluate conversations. Issues with automatic metrics are well known (Liu et al., 2016), with human evaluations still considered the gold standard. Unfortunately, how to perform human evaluations is also an open problem: differing data collection methods have varying levels of human agreement and statistical sensitivity, resulting in differing amounts of human annotation hours and labor costs. In this work we compare five different crowdworker-based human evaluation methods and find that different methods are best depending on the types of models compared, with no clear winner across the board. While this highlights the open problems in the area, our analysis leads to advice of when to use which one, and possible future directions.

show abstract

Section: Existing Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

“…

…”

mentioning

confidence: 99%

See 1 more Smart Citation

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Smith¹,

Hsu²,

Qian³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Neither the linguistic quality nor the pedagogical quality of a question can be measured by automatic means. Although metrics such as BLEU or ROGUE are often used to estimate 2 the annotated data is available at https://github.com/tsteu/deft aqg/tree/master the linguistic quality of generated texts, they only infrequently correlate with actual human judgements [44]. Hence, the investigation of the given research question requires an empirical evaluation study.…”

Section: A Research Questionmentioning

confidence: 99%

“…Hence, the reported results are harder to interpret in the context of other studies investigating automatic question generation. However, it has been argued that most automatic metrics such as BLEU [24] which have been used to compare such systems, are ill-suited for the task [44], [50] due to their low correlation with actual human judges. Hence, a direct comparison of AQG systems without human evaluation has little value.…”

Section: B Limitations Of the Evaluation Studymentioning

confidence: 99%

I Do Not Understand What I Cannot Define: Automatic Question Generation With Pedagogically-Driven Content Selection

Steuer,

Filighera,

Meuser

et al. 2021

Preprint

View full text Add to dashboard Cite

Most learners fail to develop deep text comprehension when reading textbooks passively. Posing questions about what learners have read is a well-established way of fostering their text comprehension. However, many textbooks lack self-assessment questions because authoring them is timeconsuming and expensive. Automatic question generators may alleviate this scarcity by generating sound pedagogical questions. However, generating questions automatically poses linguistic and pedagogical challenges. What should we ask? And, how do we phrase the question automatically? We address those challenges with an automatic question generator grounded in learning theory. The paper introduces a novel pedagogically meaningful content selection mechanism to find question-worthy sentences and answers in arbitrary textbook contents. We conducted an empirical evaluation study with educational experts, annotating 150 generated questions in six different domains. Results indicate a high linguistic quality of the generated questions. Furthermore, the evaluation results imply that the majority of the generated questions inquire central information related to the given text and may foster text comprehension in specific learning scenarios.

show abstract

TextKD-GAN: Text Generation Using Knowledge Distillation and Generative Adversarial Networks

Haidar

Rezagholizadeh

2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Text generation is of particular interest in many NLP applications such as machine translation, language modeling, and text summarization. Generative adversarial networks (GANs) achieved a remarkable success in high quality image generation in computer vision, and recently, GANs have gained lots of interest from the NLP community as well. However, achieving similar success in NLP would be more challenging due to the discrete nature of text. In this work, we introduce a method using knowledge distillation to effectively exploit GAN setup for text generation. We demonstrate how autoencoders (AEs) can be used for providing a continuous representation of sentences, which is a smooth representation that assign non-zero probabilities to more than one word. We distill this representation to train the generator to synthesize similar smooth representations. We perform a number of experiments to validate our idea using different datasets and show that our proposed approach yields better performance in terms of the BLEU score and Jensen-Shannon distance (JSD) measure compared to traditional GAN-based text generation approaches without pre-training.

show abstract

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Cited by 430 publications

References 21 publications

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

Human Evaluation of Conversations is an Open Problem: comparing the sensitivity of various methods for evaluating dialogue agents

I Do Not Understand What I Cannot Define: Automatic Question Generation With Pedagogically-Driven Content Selection

TextKD-GAN: Text Generation Using Knowledge Distillation and Generative Adversarial Networks

Contact Info

Product

Resources

About