Proceedings of the 12th International Conference on Natural Language Generation 2019
DOI: 10.18653/v1/w19-8654
|View full text |Cite
|
Sign up to set email alerts
|

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Abstract: We present "AutoJudge", an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
4

Relationship

2
2

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 19 publications
(22 reference statements)
0
5
0
Order By: Relevance
“…Fluent speaker might speak with the odd mistake and that this is accepted BBC Future Online Materials How smoothly and efficiently a second language speaker can speak on a range of topics in real time; While fluency may denote a degree of proficiency, it does not automatically imply accuracy -the ability to produce grammatically correct sentences -nor does it imply grammatical range Bao et al [23] Chatbot Paper Whether the generated sentence is smooth and grammatically correct Likert rating Pang et al [28] Chatbot Paper (Language Fluency) the quality of phrasing relative to a human native speaker Likert rating Sinha et al [34] Chatbot Paper Question: How naturally did this user speak English? Likert rating Shum et al [35] Chatbot Paper Whether responses are grammatically correct and sound natural Likert rating Ji et al [40] Chatbot Paper Generate utterance is readablity and grammatical correctness Likert rating Deriu and Cieliebak [92] Chatbot Paper Question: Which entities' language is more fluent and grammatically correct? Pairwise comparison Feng et al [45] Chatbot Paper Question: how likely the generated response is from human?…”
Section: Understandablementioning
confidence: 99%
“…Fluent speaker might speak with the odd mistake and that this is accepted BBC Future Online Materials How smoothly and efficiently a second language speaker can speak on a range of topics in real time; While fluency may denote a degree of proficiency, it does not automatically imply accuracy -the ability to produce grammatically correct sentences -nor does it imply grammatical range Bao et al [23] Chatbot Paper Whether the generated sentence is smooth and grammatically correct Likert rating Pang et al [28] Chatbot Paper (Language Fluency) the quality of phrasing relative to a human native speaker Likert rating Sinha et al [34] Chatbot Paper Question: How naturally did this user speak English? Likert rating Shum et al [35] Chatbot Paper Whether responses are grammatically correct and sound natural Likert rating Ji et al [40] Chatbot Paper Generate utterance is readablity and grammatical correctness Likert rating Deriu and Cieliebak [92] Chatbot Paper Question: Which entities' language is more fluent and grammatically correct? Pairwise comparison Feng et al [45] Chatbot Paper Question: how likely the generated response is from human?…”
Section: Understandablementioning
confidence: 99%
“…However, it still relies on a reference during inference. AutoJudge (Deriu and Cieliebak, 2019) removed the reliance on references, which allows the evaluation of multi-turn behavior of the dialogue system. However, AutoJudge still leverages annotated data for training.…”
Section: Related Workmentioning
confidence: 99%
“…Unfortunately, methods such as BLEU (Papineni et al, 2002) have been shown to not be applicable to conversational dialogue systems (Liu et al, 2016). Following this observation, in recent years, the trend towards training methods for evaluating dialogue systems emerged (Lowe et al, 2017;Deriu and Cieliebak, 2019;Mehri and Eskenazi, 2020;Deriu et al, 2020). The models are trained to take as input a pair of context and candidate response, and output a numerical score that rates the candidate for the given context.…”
Section: Introductionmentioning
confidence: 99%
“…Interactive evaluation systems attract increasing attention lately. Ghandeharioun et al (2019) and Deriu and Cieliebak (2019) use dialogues between a bot and itself, which is called self-talk, to evaluate the bot in a more automatic manner. But it often leads to a lot of repeated chat context.…”
Section: Related Workmentioning
confidence: 99%