Evaluating Coherence in Dialogue Systems using Entailment

Dziri, Nouha; Kamalloo, Ehsan; Mathewson, Kory W.; Zaı̈ane, Osmar R.

doi:10.18653/v1/n19-1381

Cited by 59 publications

(54 citation statements)

References 28 publications

(39 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We test the KvBERT on two tasks: (1) Reranking the top 20 responses from a retrieval model, to see whether the profile consistency is improved (Welleck et al, 2019). (2) Given the responses from state-of-theart generative dialogue models, to see how well the KvBERT's consistency prediction agrees with the human annotation (Dziri et al, 2019).…”

Section: Testing On Downstream Tasksmentioning

confidence: 99%

“…Experimental results show that KvBERT obtains significant improvements over strong baselines. We further test the KvBERT model on two downstream tasks, including a reranking task (Welleck et al, 2019) and a consistency prediction task (Dziri et al, 2019). Evaluation results show that (1) the KvBERT reranking improves response consistency, and (2) the KvBERT consistency prediction has a good agreement with human annotation.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Profile Consistency Identification for Open-domain Dialogue Agents

Song¹,

Wang²,

Zhang³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

Maintaining a consistent attribute profile is crucial for dialogue agents to naturally converse with humans. Existing studies on improving attribute consistency mainly explored how to incorporate attribute information in the responses, but few efforts have been made to identify the consistency relations between response and attribute profile. To facilitate the study of profile consistency identification, we create a large-scale human-annotated dataset with over 110K single-turn conversations and their key-value attribute profiles. Explicit relation between response and profile is manually labeled. We also propose a key-value structure information enriched BERT model to identify the profile consistency, and it gained improvements over strong baselines. Further evaluations on downstream tasks demonstrate that the profile consistency identification model is conducive for improving dialogue consistency. Gender Female Name Elena Current Location Beijing Constellation Aquarius Age Post-90s R1: I am glad you could come to Beijing. R3: I'll show you around Tsinghua University. R2: I also hope to visit Beijing one day. Query: I will go to Beijing tomorrow Entailed Contradicted

show abstract

Section: Testing On Downstream Tasksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Profile Consistency Identification for Open-domain Dialogue Agents

Song¹,

Wang²,

Zhang³

et al. 2020

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

View full text Add to dashboard Cite

show abstract

“…Evaluating and interpreting open-domain dialog models is notoriously challenging. Multiple studies have shown that standard evaluation metrics such as perplexity and BLEU scores (Papineni et al, 2002) correlate very weakly with human judgements of conversation quality Dziri et al, 2019). This has inspired multiple new approaches for evaluating dialog systems.…”

Section: Related Workmentioning

confidence: 99%

“…This has inspired multiple new approaches for evaluating dialog systems. One popular evaluation metric involves calculating the semantic similarity between the user input and generated response in high-dimensional embedding space Dziri et al, 2019;Park et al, 2018;Zhao et al, 2017;. proposed calculating conversation metrics such as sentiment and coherence on self-play conversations generated by trained models.…”

Section: Related Workmentioning

confidence: 99%

Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI

2020

View full text Add to dashboard Cite

One of the core components of voice assistants is the Natural Language Understanding (NLU) model. Its ability to accurately classify the user's request (or "intent") and recognize named entities in an utterance is pivotal to the success of these assistants. NLU models can be challenged in some languages by code-switching or morphological and orthographic variations. This work explores the possibility of improving the accuracy of NLU models for Indic languages via the use of alternate representations of input text for NLU, specifically ISO-15919 and IndicSOUNDEX, a custom SOUNDEX designed to work for Indic languages. We used a deep neural network based model to incorporate the information from alternate representations into the NLU model. We show that using alternate representations significantly improves the overall performance of NLU models when the amount of training data is limited.

show abstract

“…For this initial study, we focus on two metrics, readability and coherence. These metrics are among those essential to evaluate the quality of generated responses (Novikova et al, 2017;Dziri et al, 2019). We describe an automated method to compute each metric.…”

Section: Metricsmentioning

confidence: 99%

Towards Best Experiment Design for Evaluating Dialogue System Output

Santhanam¹,

Shaikh²

2019

Proceedings of the 12th International Conference on Natural Language Generation

View full text Add to dashboard Cite

To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters.

show abstract

Evaluating Coherence in Dialogue Systems using Entailment

Cited by 59 publications

References 28 publications

Profile Consistency Identification for Open-domain Dialogue Agents

Profile Consistency Identification for Open-domain Dialogue Agents

Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI

Towards Best Experiment Design for Evaluating Dialogue System Output

Contact Info

Product

Resources

About