Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Deriu, Jan; Cieliebak, Mark

doi:10.18653/v1/w19-8654

Cited by 4 publications

(5 citation statements)

References 19 publications

(22 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Fluent speaker might speak with the odd mistake and that this is accepted BBC Future Online Materials How smoothly and efficiently a second language speaker can speak on a range of topics in real time; While fluency may denote a degree of proficiency, it does not automatically imply accuracy -the ability to produce grammatically correct sentences -nor does it imply grammatical range Bao et al [23] Chatbot Paper Whether the generated sentence is smooth and grammatically correct Likert rating Pang et al [28] Chatbot Paper (Language Fluency) the quality of phrasing relative to a human native speaker Likert rating Sinha et al [34] Chatbot Paper Question: How naturally did this user speak English? Likert rating Shum et al [35] Chatbot Paper Whether responses are grammatically correct and sound natural Likert rating Ji et al [40] Chatbot Paper Generate utterance is readablity and grammatical correctness Likert rating Deriu and Cieliebak [92] Chatbot Paper Question: Which entities' language is more fluent and grammatically correct? Pairwise comparison Feng et al [45] Chatbot Paper Question: how likely the generated response is from human?…”

Section: Understandablementioning

confidence: 99%

Towards Standard Criteria for human evaluation of Chatbots: A Survey

Liang,

2021

Preprint

View full text Add to dashboard Cite

Human evaluation is becoming a necessity to test the performance of Chatbots. However, off-the-shelf settings suffer the severe reliability and replication issues partly because of the extremely high diversity of criteria. It is high time to come up with standard criteria and exact definitions. To this end, we conduct a through investigation of 105 papers involving human evaluation for Chatbots. Deriving from this, we propose five standard criteria along with precise definitions.

show abstract

Section: Understandablementioning

confidence: 99%

Towards Standard Criteria for human evaluation of Chatbots: A Survey

Liang,

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…However, it still relies on a reference during inference. AutoJudge (Deriu and Cieliebak, 2019) removed the reliance on references, which allows the evaluation of multi-turn behavior of the dialogue system. However, AutoJudge still leverages annotated data for training.…”

Section: Related Workmentioning

confidence: 99%

“…Unfortunately, methods such as BLEU (Papineni et al, 2002) have been shown to not be applicable to conversational dialogue systems (Liu et al, 2016). Following this observation, in recent years, the trend towards training methods for evaluating dialogue systems emerged (Lowe et al, 2017;Deriu and Cieliebak, 2019;Mehri and Eskenazi, 2020;Deriu et al, 2020). The models are trained to take as input a pair of context and candidate response, and output a numerical score that rates the candidate for the given context.…”

Section: Introductionmentioning

confidence: 99%

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

Deriu¹,

Tuggener²,

Däniken³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

This paper introduces an adversarial method to stress-test trained metrics to evaluate conversational dialogue systems. The method leverages Reinforcement Learning to find response strategies that elicit optimal scores from the trained metrics. We apply our method to test recently proposed trained metrics. We find that they all are susceptible to giving high scores to responses generated by relatively simple and obviously flawed strategies that our method converges on. For instance, simply copying parts of the conversation context to form a response yields competitive scores or even outperforms responses written by humans.

show abstract

“…Interactive evaluation systems attract increasing attention lately. Ghandeharioun et al (2019) and Deriu and Cieliebak (2019) use dialogues between a bot and itself, which is called self-talk, to evaluate the bot in a more automatic manner. But it often leads to a lot of repeated chat context.…”

Section: Related Workmentioning

confidence: 99%

ChatMatch: Evaluating Chatbots by Autonomous Chat Tournaments

Yang¹,

Li²,

Tang³

et al. 2022

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

View full text Add to dashboard Cite

Existing automatic evaluation systems of chatbots mostly rely on static chat scripts as ground truth, which is hard to obtain, and requires access to the models of the bots as a form of "white-box testing". Interactive evaluation mitigates this problem but requires human involvement. In our work, we propose an interactive chatbot evaluation framework in which chatbots compete with each other like in a sports tournament, using flexible scoring metrics. This framework can efficiently rank chatbots independently from their model architectures and the domains for which they are trained.

show abstract

Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

Cited by 4 publications

References 19 publications

Towards Standard Criteria for human evaluation of Chatbots: A Survey

Towards Standard Criteria for human evaluation of Chatbots: A Survey

Probing the Robustness of Trained Metrics for Conversational Dialogue Systems

ChatMatch: Evaluating Chatbots by Autonomous Chat Tournaments

Contact Info

Product

Resources

About