Proceedings of the 13th International Conference on Natural Language Generation 2020
DOI: 10.18653/v1/2020.inlg-1.24
|View full text |Cite
|
Sign up to set email alerts
|

Disentangling the Properties of Human Evaluation Methods: A Classification System to Support Comparability, Meta-Evaluation and Reproducibility Testing

Anya Belz,
Simon Mille,
David M. Howcroft

Abstract: Current standards for designing and reporting human evaluations in NLP mean it is generally unclear which evaluations are comparable and can be expected to yield similar results when applied to the same system outputs. This has serious implications for reproducibility testing and meta-evaluation, in particular given that human evaluation is considered the gold standard against which the trustworthiness of automatic metrics is gauged. Using examples from NLG, we propose a classification system for evaluations b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(2 citation statements)
references
References 35 publications
0
2
0
Order By: Relevance
“…When evaluating Comprehensbility of a translated text, the source language text is not shown to evaluators. In terms of the classification system proposed by Belz et al (2020), Comprehensibility captures the goodness of both the form and content of a text in its own right, and is assessed here by a subjective, absolute, intrinsic evaluation measure.…”
Section: Comprehensibilitymentioning
confidence: 99%
“…When evaluating Comprehensbility of a translated text, the source language text is not shown to evaluators. In terms of the classification system proposed by Belz et al (2020), Comprehensibility captures the goodness of both the form and content of a text in its own right, and is assessed here by a subjective, absolute, intrinsic evaluation measure.…”
Section: Comprehensibilitymentioning
confidence: 99%
“…There has been much discussion lately in the field of Natural Language Generation (NLG) focusing the need for evaluation benchmarks and standards, as evidenced by the prolific literature focusing on the issues surrounding human evaluation (Howcroft et al, 2020;Clark et al, 2021;Hämäläinen and Alnajjar, 2021;, as well as recently proposed benchmarks Khashabi et al, 2021;Mille et al, 2021). These are important and necessary debates -however, work has focused mainly on two-party dialogue systems.…”
Section: Introductionmentioning
confidence: 99%