Craig Thomson scite author profile

Human evaluation is widely regarded as the litmus test of quality in NLP. A basic requirement of all evaluations, but in particular where used for meta-evaluation, is that they should support the same conclusions if repeated. However, the reproducibility of human evaluations is virtually never queried in NLP, let alone formally tested, and their repeatability and reproducibility of results is currently an open question. This paper reports our review of human evaluation experiments published in NLP papers over the past five years which we assessed in terms of (i) their ability to be rerun, and (ii) their results being reproduced where they can be rerun. Overall, we estimate that just 5% of human evaluations are repeatable in the sense that (i) there are no prohibitive barriers to repetition, and (ii) sufficient information about experimental design is publicly available for rerunning them. Our estimate goes up to about 20% when author help is sought. We complement this investigation with a survey of results concerning the reproducibility of human evaluations where those are repeatable in the first place. Here we find worryingly low degrees of reproducibility, both in terms of similarity of scores and of the findings supported by them. We summarise what insights can be gleaned so far regarding how to make human evaluations in NLP more repeatable and more reproducible.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Craig Thomson

Evaluating factual accuracy in complex data-to-text

Comprehension Driven Document Planning in Natural Language Generation Systems

Min-max Training: Adversarially Robust Learning Models for Network Intrusion Detection Systems

Non-Repeatable Experiments and Non-Reproducible Results: The Reproducibility Crisis in Human Evaluation in NLP

Contact Info

Product

Resources

About