2020
DOI: 10.48550/arxiv.2005.10716
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Abstract: Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly referencebased. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, selfreported user rating suffers from bias and variance among different users. To allevia… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 43 publications
0
5
0
Order By: Relevance
“…Therefore, this plot may more reflect the UX of our response generators than a user preference for longer responses. These results may also reflect the inherent noise in user Likert-scale ratings (Liang et al, 2020).…”
Section: Analysis 61 Relationship Between Rating and Engagementmentioning
confidence: 97%
“…Therefore, this plot may more reflect the UX of our response generators than a user preference for longer responses. These results may also reflect the inherent noise in user Likert-scale ratings (Liang et al, 2020).…”
Section: Analysis 61 Relationship Between Rating and Engagementmentioning
confidence: 97%
“…Pairwise versus single-model ratings Conversations are often either rated individually, e.g. with Likert-score ratings (Ashwin et al, 2017;Venkatesh et al, 2018;Zhang et al, 2018;Rashkin et al, 2019;See et al, 2019a;Dinan et al, 2019b, or pairwise by comparing models Liang et al, 2020;Vinyals and Le, 2015;Li et al, 2016;Lee et al, 2020). Likert scoring relies on absolute identification rather than relative discrimination, which is less reliable in humans (Stewart et al, 2005), leading to different biases per annotator (Kulikov et al, 2019).…”
Section: Existing Workmentioning
confidence: 99%
“…Other examples, like Elliot et al [7] and Mueser et al [25] found high correlations between rankings resulting from the evaluation of physical features in humans. Liang et al [20] proposes a model to 'calibrate' self-reported user ratings for dialogue systems due to issues with validity and bias. In relation to biomedical image assessments, where evaluation considers the visual quality of the stimuli, Phelps et al [27] found that pairwise comparisons and ranked Likert scores made for more accurate assessments in comparison to the use of non-ranked Likert scores.…”
Section: Related Workmentioning
confidence: 99%