2021
DOI: 10.1109/taslp.2021.3074012
|View full text |Cite
|
Sign up to set email alerts
|

D-Score: Holistic Dialogue Evaluation Without Reference

Abstract: In artistic gymnastics, difficulty score or D-score is used for judging performance. Starting from zero, an athlete earns points from different aspects such as composition requirement, difficulty, and connection between moves. The final score is a composition of the quality of various performance indicators. Similarly, when evaluating dialogue responses, human judges generally follow a number of criteria, among which language fluency, context coherence, logical consistency, and semantic appropriateness are on … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
20
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

3
6

Authors

Journals

citations
Cited by 14 publications
(23 citation statements)
references
References 44 publications
0
20
0
Order By: Relevance
“…1 https://github.com/e0397123/DynaEval Commonly used static metrics, such as BLEU (Papineni et al, 2002), ME-TEOR (Denkowski and Lavie, 2014) and ROUGE (Lin, 2004), correlate poorly with human judgements (Liu et al, 2016) rendering them unsuitable for dialogue evaluation. While some recent automatic dialogue evaluation metrics (Ghazarian et al, 2019;Mehri and Eskenazi, 2020b;Zhang et al, 2021b) demonstrate strong correlations with human judgement at the turn-level, they only focus on context-response pairs without explicitly modeling the interaction over an entire dialogue. To perform dialogue-level evaluation, we need to rely on the aggregation of turn-level scores over the dialogue as a proxy for a dialogue-level score.…”
Section: Introductionmentioning
confidence: 99%
“…1 https://github.com/e0397123/DynaEval Commonly used static metrics, such as BLEU (Papineni et al, 2002), ME-TEOR (Denkowski and Lavie, 2014) and ROUGE (Lin, 2004), correlate poorly with human judgements (Liu et al, 2016) rendering them unsuitable for dialogue evaluation. While some recent automatic dialogue evaluation metrics (Ghazarian et al, 2019;Mehri and Eskenazi, 2020b;Zhang et al, 2021b) demonstrate strong correlations with human judgement at the turn-level, they only focus on context-response pairs without explicitly modeling the interaction over an entire dialogue. To perform dialogue-level evaluation, we need to rely on the aggregation of turn-level scores over the dialogue as a proxy for a dialogue-level score.…”
Section: Introductionmentioning
confidence: 99%
“…Two possible types of inconsistency occur in open-domain dialogue generation: (1) inconsistency among the system utterances such as when the system contradicts its previous utterance; (2) inconsistency with some external source, such as factually incorrect utterances. Whereas the first type is described using the term "consistency" [100,186,208] or "coherence" [11,39], people recently start to call the second type "hallucination" [122,152]. Selfinconsistency can be considered as an intrinsic hallucination problem, while the external inconsistency involves both intrinsic and extrinsic hallucinations, depending on the reference source.…”
Section: Open-domain Dialogue Generationmentioning
confidence: 99%
“…Model-based Metric. Recently, several works have proposed evaluation metrics for measuring consistency, such as using natural language inference (NLI) [39,186], training learnable evaluation metrics [208], or releasing additional test set for coherence [11]. For the KGD task, Dziri et al [41] propose the BEGIN benchmark, which consists of samples taken from Dinan et al [31] with additional human annotation and a new classification task extending the Natural Language Inference (NLI) paradigm.…”
Section: Hallucination Metrics For Generation-based Dialogue Systems ...mentioning
confidence: 99%
“…HolisticEval (Pang et al, 2020) adopts different models for evaluating several qualities of dialog: context coherence, language fluency, response diversity, and logical self-consistency. D-score (Zhang et al, 2021d) adopts a single multitask model for evaluating various dialog qualities including context coherence, language fluency, logical self-consistency, and semantic appropriateness. Deep AM-FM (Zhang et al, 2021c) measures both semantic similarity and response fluency and the PARADISE-style model of (Walker et al, 2021) uses both predicted user ratings and dialog length.…”
Section: Automatic Evaluation Metrics For Dialogmentioning
confidence: 99%