2016
DOI: 10.1007/978-3-319-21834-2_19
|View full text |Cite
|
Sign up to set email alerts
|

A Semi-automated Evaluation Metric for Dialogue Model Coherence

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 15 publications
(19 citation statements)
references
References 13 publications
0
19
0
Order By: Relevance
“…(DeVault et al, 2011) train an automatic dialogue policy evaluation metric from 19 structured role-playing sessions, enriched with paraphrases and external referee annotations. (Gandhe and Traum, 2016) propose a semi-automatic evaluation metric for dialogue coherence, similar to BLEU and ROUGE, based on 'wizard of Oz' type data. 6 (Xiang et al, 2014) propose a framework to predict utterance-level problematic situations in a dataset of Chinese dialogues using intent and sentiment factors.…”
Section: Related Workmentioning
confidence: 99%
“…(DeVault et al, 2011) train an automatic dialogue policy evaluation metric from 19 structured role-playing sessions, enriched with paraphrases and external referee annotations. (Gandhe and Traum, 2016) propose a semi-automatic evaluation metric for dialogue coherence, similar to BLEU and ROUGE, based on 'wizard of Oz' type data. 6 (Xiang et al, 2014) propose a framework to predict utterance-level problematic situations in a dataset of Chinese dialogues using intent and sentiment factors.…”
Section: Related Workmentioning
confidence: 99%
“…DeVault et al 2011and Gandhe and Traum (2016) tackle the problem of having multiple relevant candidate utterances and propose a metric which takes this into account. Their metrics are both dependent on human judges and measure the appropriateness of an utterance.…”
Section: Utterance Selection Metricsmentioning
confidence: 99%
“…Voted appropriateness One major drawback of the weak agreement is that it depends on human annotations and is not applicable to large amounts of data. Gandhe and Traum (2016) improve upon the idea of weak agreement by introducing the Voted Appropriateness metric. Voted Appropriateness takes the number of judges into account which selected an utterance for a given context.…”
Section: Utterance Selection Metricsmentioning
confidence: 99%
“…The robustness of the evaluation of chatbots is often hampered by inter-annotator agreement (IAA) (Gandhe and Traum, 2016). Measuring and reporting IAA is not yet a standard practice in evaluating chatbots (Amidei et al, 2019a), and producing annotations with high IAA on open-domain conversations is prone to be impeded by subjective interpretation of feature definitions and idiosyncratic annotator behavior (Bishop and Herron, 2015).…”
Section: On Inter-annotator Agreementmentioning
confidence: 99%