Proceedings of the 55th Annual Meeting of the Association For Computational Linguistics (Volume 1: Long Papers) 2017
DOI: 10.18653/v1/p17-1103
|View full text |Cite
|
Sign up to set email alerts
|

Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Abstract: Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality. Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

7
289
0
4

Year Published

2018
2018
2021
2021

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 284 publications
(300 citation statements)
references
References 38 publications
7
289
0
4
Order By: Relevance
“…For this, we sample 100 different contexts randomly from a set of unseen contexts and let the dialogue system generate a dialogue starting from this context, which consist of 10 turns each. For the annotation process, we use Amazon Mechanical Turk (AMT) 1 and follow the procedure outlined by (Lowe et al, 2017), i.e. the judges rated the overall quality of each turn on a scale from 1 (low quality) to 5 (high quality).…”
Section: Turn-level Annotationmentioning
confidence: 99%
See 1 more Smart Citation
“…For this, we sample 100 different contexts randomly from a set of unseen contexts and let the dialogue system generate a dialogue starting from this context, which consist of 10 turns each. For the annotation process, we use Amazon Mechanical Turk (AMT) 1 and follow the procedure outlined by (Lowe et al, 2017), i.e. the judges rated the overall quality of each turn on a scale from 1 (low quality) to 5 (high quality).…”
Section: Turn-level Annotationmentioning
confidence: 99%
“…Trained Metrics. Recently, the notion of trained metrics was introduced for conversational dialogue systems (Lowe et al, 2017). The main idea is that humans rate the generated response of a dialogue system in relation to a given context (i.e.…”
Section: Introductionmentioning
confidence: 99%
“…QE has been an active topic in many NLP tasksimage captioning (Anderson et al, 2016), dialogue response generation (Lowe et al, 2017), grammar correction (Napoles et al, 2016) or text simplification (Martin et al, 2018)-with MT being perhaps the most prominent area (Specia et al, 2010;Avramidis, 2012;Specia et al, 2018). QE for NLG recently saw an increase of focus in various subtasks, such as title generation (Ueffing et al, 2018;Camargo de Souza et al, 2018) or content selection and ordering (Wiseman et al, 2017).…”
Section: Related Workmentioning
confidence: 99%
“…Each query has 50 candidate tables on average. It is still an open problem to automatically evaluate the performance of a natural language generation system (Lowe et al, 2017). In this work, we use BLEU-4 (Papineni et al, 2002) score as the evaluation metric, which measures the overlap between the generated question and the referenced question.…”
Section: Table Based Qa and Qgmentioning
confidence: 99%