2021
DOI: 10.48550/arxiv.2106.00507
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards Quantifiable Dialogue Coherence Evaluation

Abstract: Automatic dialogue coherence evaluation has attracted increasing attention and is crucial for developing promising dialogue systems. However, existing metrics have two major limitations: (a) they are mostly trained in a simplified two-level setting (coherent vs. incoherent), while humans give Likert-type multi-level coherence scores, dubbed as "quantifiable"; (b) their predicted coherence scores cannot align with the actual human rating standards due to the absence of human guidance during training. To address… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 22 publications
0
2
0
Order By: Relevance
“…Baseline We compare our evaluation metrics with eleven popular automatic dialogue evaluation metrics, including three lexical word-overlap metrics: BLEU, ROUGE, and METEOR (Banerjee and Lavie 2005), five metrics that consider semantic representation: BERTScore, ADEM (Lowe et al 2017), BERT-RUBER, BLEURT, QuantiDCE (Ye et al 2021), two metrics that take into account additional information about the dialogue: DynaEval, GRADE, and Chat-GPT. Evaluation The common practice to show the effectiveness of a dialogue evaluation metric is to calculate the correlation between the model-predicted and the humanrated scores (Zhang et al 2021;Huang et al 2020).…”
Section: Experiments Experimental Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…Baseline We compare our evaluation metrics with eleven popular automatic dialogue evaluation metrics, including three lexical word-overlap metrics: BLEU, ROUGE, and METEOR (Banerjee and Lavie 2005), five metrics that consider semantic representation: BERTScore, ADEM (Lowe et al 2017), BERT-RUBER, BLEURT, QuantiDCE (Ye et al 2021), two metrics that take into account additional information about the dialogue: DynaEval, GRADE, and Chat-GPT. Evaluation The common practice to show the effectiveness of a dialogue evaluation metric is to calculate the correlation between the model-predicted and the humanrated scores (Zhang et al 2021;Huang et al 2020).…”
Section: Experiments Experimental Setupmentioning
confidence: 99%
“…Dialogue coherence evaluation is essential for research on open dialogue systems, which refers to the coherence and consistency of the content and structure of the dialogue (See et al 2019;Ye et al 2021). Dialogues exhibit higher coherence when the responses are fluent in the language, clear in meaning, context-sensitive, and logically tight.…”
Section: Introductionmentioning
confidence: 99%