2022
DOI: 10.48550/arxiv.2202.06935
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Abstract: Evaluation practices in natural language generation (NLG) have many known flaws, but improved evaluation approaches are rarely widely adopted.This issue has become more urgent, since neural NLG models have improved to the point where they can often no longer be distinguished based on the surfacelevel features that older metrics rely on. This paper surveys the issues with human and automatic model evaluations and with commonly used datasets in NLG that have been pointed out over the past 20 years. We summarize,… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
22
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
9

Relationship

0
9

Authors

Journals

citations
Cited by 16 publications
(22 citation statements)
references
References 120 publications
(125 reference statements)
0
22
0
Order By: Relevance
“…For each dataset, we report the best previous sota result. For generation tasks, we generally report ROUGE-2 following the advice of (Gehrmann et al, 2022). For the rest of the datasets, we report the dominant metric that is reported in prior work.…”
Section: Datasets For Supervised Finetuningmentioning
confidence: 99%
“…For each dataset, we report the best previous sota result. For generation tasks, we generally report ROUGE-2 following the advice of (Gehrmann et al, 2022). For the rest of the datasets, we report the dominant metric that is reported in prior work.…”
Section: Datasets For Supervised Finetuningmentioning
confidence: 99%
“…Gundersen & Kjensmo (2018) highlight problems with reproducibility and replicability in AI research. Gehrmann et al (2022) comprehensively discuss issues with Natural Language Generation research, including statistical significance -something that Marie et al (2021) investigate for neural machine translation, specifically. Berg-Kirkpatrick et al (2012) and Card et al (2020) investigate the limitations of p-values and statistical power in NLP.…”
Section: Related Workmentioning
confidence: 99%
“…It is important to either reference the exact evaluation script used (including parameters, citation and version, if applicable) or at least include the evaluation script in the code base. Moreover, to ease error or post-hoc analyses, we highly recommend saving model predictions in separate files whenever possible, and making them available at publication Card et al (2020);Gehrmann et al (2022). This could for instance be done using plain .txt or .csv files.…”
Section: Modelsmentioning
confidence: 99%
“…At the same time, interest in Deep Learning (DL) has increased substantially as well, demonstrated via Google Trends in the same figure. While such progress is remarkable, rapid growth comes at a cost: Akin to concerns in other disciplines (John et al, 2012;Jensen et al, 2021), several authors have noted issues with reproducibility (Gundersen & Kjensmo, 2018;Belz et al, 2021) and a lack of significance testing (Marie et al, 2021) or published results not carrying over to different experimental setups, for instance in NLP (Narang et al, 2021;Gehrmann et al, 2022), Reinforcement Learning (Henderson et al, 2018;Agarwal et al, 2021), and optimization (Schmidt et al, 2021a). Others have questioned commonly-accepted procedures (Gorman & Bedrick, 2019;Søgaard et al, 2021;Bouthillier et al, 2021;van der Goot, 2021) as well as the (negative) impacts of research on society (Hovy & Spruit, 2016;Mohamed et al, 2020;Bender et al, 2021;Birhane et al, 2021) and environment (Strubell et al, 2019;Schwartz et al, 2020;.…”
mentioning
confidence: 99%