Proceedings of the 2019 Conference of the North 2019
DOI: 10.18653/v1/n19-1381
|View full text |Cite
|
Sign up to set email alerts
|

Evaluating Coherence in Dialogue Systems using Entailment

Abstract: Evaluating open-domain dialogue systems is difficult due to the diversity of possible correct answers. Automatic metrics such as BLEU correlate weakly with human annotations, resulting in a significant bias across different models and datasets. Some researchers resort to human judgment experimentation for assessing response quality, which is expensive, time consuming, and not scalable. Moreover, judges tend to evaluate a small number of dialogues, meaning that minor differences in evaluation configuration may … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
53
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 59 publications
(54 citation statements)
references
References 28 publications
(39 reference statements)
1
53
0
Order By: Relevance
“…We test the KvBERT on two tasks: (1) Reranking the top 20 responses from a retrieval model, to see whether the profile consistency is improved (Welleck et al, 2019). (2) Given the responses from state-of-theart generative dialogue models, to see how well the KvBERT's consistency prediction agrees with the human annotation (Dziri et al, 2019).…”
Section: Testing On Downstream Tasksmentioning
confidence: 99%
See 1 more Smart Citation
“…We test the KvBERT on two tasks: (1) Reranking the top 20 responses from a retrieval model, to see whether the profile consistency is improved (Welleck et al, 2019). (2) Given the responses from state-of-theart generative dialogue models, to see how well the KvBERT's consistency prediction agrees with the human annotation (Dziri et al, 2019).…”
Section: Testing On Downstream Tasksmentioning
confidence: 99%
“…Experimental results show that KvBERT obtains significant improvements over strong baselines. We further test the KvBERT model on two downstream tasks, including a reranking task (Welleck et al, 2019) and a consistency prediction task (Dziri et al, 2019). Evaluation results show that (1) the KvBERT reranking improves response consistency, and (2) the KvBERT consistency prediction has a good agreement with human annotation.…”
Section: Introductionmentioning
confidence: 99%
“…Evaluating and interpreting open-domain dialog models is notoriously challenging. Multiple studies have shown that standard evaluation metrics such as perplexity and BLEU scores (Papineni et al, 2002) correlate very weakly with human judgements of conversation quality Dziri et al, 2019). This has inspired multiple new approaches for evaluating dialog systems.…”
Section: Related Workmentioning
confidence: 99%
“…This has inspired multiple new approaches for evaluating dialog systems. One popular evaluation metric involves calculating the semantic similarity between the user input and generated response in high-dimensional embedding space Dziri et al, 2019;Park et al, 2018;Zhao et al, 2017;. proposed calculating conversation metrics such as sentiment and coherence on self-play conversations generated by trained models.…”
Section: Related Workmentioning
confidence: 99%
“…For this initial study, we focus on two metrics, readability and coherence. These metrics are among those essential to evaluate the quality of generated responses (Novikova et al, 2017;Dziri et al, 2019). We describe an automated method to compute each metric.…”
Section: Metricsmentioning
confidence: 99%