2016
DOI: 10.48550/arxiv.1603.08023
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

7
185
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 430 publications
(193 citation statements)
references
References 21 publications
7
185
1
Order By: Relevance
“…Automatic metrics Automatic metrics are the most convenient for fast, efficient and reproducible research with a quick turn-around and development cycle, hence they are frequently used. Unfortunately, many of them, such as BLEU, METEOR and ROUGE have been shown to only "correlate very weakly with human judgement" (Liu et al, 2016). A central problem is that due to the openended nature of conversations, there are many possible responses in a given dialogue, and, while having multiple references can help, there is typically only one gold label available (Gupta et al, 2019).…”
Section: Existing Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Automatic metrics Automatic metrics are the most convenient for fast, efficient and reproducible research with a quick turn-around and development cycle, hence they are frequently used. Unfortunately, many of them, such as BLEU, METEOR and ROUGE have been shown to only "correlate very weakly with human judgement" (Liu et al, 2016). A central problem is that due to the openended nature of conversations, there are many possible responses in a given dialogue, and, while having multiple references can help, there is typically only one gold label available (Gupta et al, 2019).…”
Section: Existing Workmentioning
confidence: 99%
“…Any comprehensive analysis of the performance of an open-domain conversational model must include human evaluations: automatic metrics can capture certain aspects of model performance but are no replacement for having human raters judge how adept models are at realistic and interesting conversation (Deriu et al, 2021;Liu et al, 2016;Dinan et al, 2019b). Unfortunately, human evaluations themselves must be carefully constructed in order to capture all the aspects desired of a good conversationalist.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Neither the linguistic quality nor the pedagogical quality of a question can be measured by automatic means. Although metrics such as BLEU or ROGUE are often used to estimate 2 the annotated data is available at https://github.com/tsteu/deft aqg/tree/master the linguistic quality of generated texts, they only infrequently correlate with actual human judgements [44]. Hence, the investigation of the given research question requires an empirical evaluation study.…”
Section: A Research Questionmentioning
confidence: 99%
“…Hence, the reported results are harder to interpret in the context of other studies investigating automatic question generation. However, it has been argued that most automatic metrics such as BLEU [24] which have been used to compare such systems, are ill-suited for the task [44], [50] due to their low correlation with actual human judges. Hence, a direct comparison of AQG systems without human evaluation has little value.…”
Section: B Limitations Of the Evaluation Studymentioning
confidence: 99%