Proceedings of the 12th International Conference on Natural Language Generation 2019
DOI: 10.18653/v1/w19-8643
|View full text |Cite
|
Sign up to set email alerts
|

Best practices for the human evaluation of automatically generated text

Abstract: Currently, there is little agreement as to how Natural Language Generation (NLG) systems should be evaluated, with a particularly high degree of variation in the way that human evaluation is carried out. This paper provides an overview of how human evaluation is currently conducted, and presents a set of best practices, grounded in the literature. With this paper, we hope to contribute to the quality and consistency of human evaluations in NLG.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

5
96
0
1

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 125 publications
(102 citation statements)
references
References 82 publications
(95 reference statements)
5
96
0
1
Order By: Relevance
“…Because of the minimal difference of 0.04, we decided to still use the PU measure in our further analysis for interpretation. With these results, we achieved a better agreement level than the average expert agreement of summarization evaluation reported in other papers Van Der Lee et al (2019).…”
Section: Comparing Crowd With Expertsupporting
confidence: 51%
See 2 more Smart Citations
“…Because of the minimal difference of 0.04, we decided to still use the PU measure in our further analysis for interpretation. With these results, we achieved a better agreement level than the average expert agreement of summarization evaluation reported in other papers Van Der Lee et al (2019).…”
Section: Comparing Crowd With Expertsupporting
confidence: 51%
“…Therefore, more and more researchers refrain from using automatic metrics as a primary evaluation method (Reiter, 2018). Still, Van Der Lee et al (2019) report that 80% of the empirical papers presented at the ACL track on NLG or at the INLG conference in 2018 using automatic metrics due to the lack of alternatives and the fast and cost-effective nature.…”
Section: Untrained Automatic Metricsmentioning
confidence: 99%
See 1 more Smart Citation
“…The choice for a 5-point scale lies in the fact that this type of rating system is the most used in human evaluation tasks and recommended in numerous studies. In essence, this scale gives more reliable results than the ones with finer granularity because it seems to be easier to understand and handle for subjects (Korshunov et al, 2015;Sinkowitz et al, 2013, van der Lee, 2019.…”
Section: Methodsmentioning
confidence: 99%
“…Human-Bot Conversations. In order to perform interactive multi-turn evaluations, the standard method is to let humans converse with a chatbot and rate it afterward (Ghandeharioun et al, 2019), typically using Likert scales (van der Lee et al, 2019). The ConvAI2 challenge (Dinan et al, 2020b) and the Alexa Prize (Venkatesh et al, 2018) applied this procedure.…”
Section: Related Workmentioning
confidence: 99%