2021
DOI: 10.1162/tacl_a_00417
|View full text |Cite
|
Sign up to set email alerts
|

A Statistical Analysis of Summarization Evaluation Metrics Using Resampling Methods

Abstract: The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is unclear how precise these correlation estimates are, nor whether differences between two metrics’ correlations reflect a true difference or if it is due to mere chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
42
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 27 publications
(42 citation statements)
references
References 27 publications
0
42
0
Order By: Relevance
“…Automatic evaluation metrics are the most common method that researchers use to quickly and cheaply approximate how humans would rate the quality of a summarization system (Lin, 2004;Louis and Nenkova, 2013;Zhao et al, 2019;Zhang et al, 2020;Deutsch et al, 2021a, among others). The quality of a metric -how similarly it replicates human judgments of systems -is quantified by calculating the correlation between the metric's scores and human judgments on a set of systems, known as the system-level correlation (Louis and Nenkova, 2013;Deutsch et al, 2021b).…”
Section: Introductionmentioning
confidence: 99%
“…Automatic evaluation metrics are the most common method that researchers use to quickly and cheaply approximate how humans would rate the quality of a summarization system (Lin, 2004;Louis and Nenkova, 2013;Zhao et al, 2019;Zhang et al, 2020;Deutsch et al, 2021a, among others). The quality of a metric -how similarly it replicates human judgments of systems -is quantified by calculating the correlation between the metric's scores and human judgments on a set of systems, known as the system-level correlation (Louis and Nenkova, 2013;Deutsch et al, 2021b).…”
Section: Introductionmentioning
confidence: 99%
“…Comparing correlations of these metrics across shared tasks from the Text Analysis Conferences (TAC) and CNN/DM and using a different annotation scheme, Bhandari et al (2020b) corroborate the very low segment-level correlations and also find that that no distributional metric outperforms ROUGE. Reanalyzing the data and addressing issues in the statistical tests, Deutsch et al (2021b) come to the same conclusion about ROUGE, but note the insights should be carefully assessed since the data selection strategy for annotations, coupled with large confidence intervals, can lead to false results. Beyond summarization, Novikova et al (2017a) note similarly poor segment-level correlations for data-to-text datasets.…”
Section: How To Interpret Similarity-based Metrics?mentioning
confidence: 97%
“…The number of required annotations can potentially be decreased by not uniformly sampling examples to annotate and instead biasing the sampling toward those where models differ. However, this process can lead to artificially high correlation of the results with automatic metrics, which could overstate their effectiveness and the quality of human annotations (Deutsch et al, 2021b). Moreover, since NLG models may only differ in very few examples, statistical analyses should also handle ties as discussed by Dras (2015) for pairwise rankings.…”
Section: Statistical Significancementioning
confidence: 99%
“…They are just estimates, and there is no standard method for determining confidence in them. 13 Consider the following two statistical "facts": 1) total revenue generated by arcades correlates with the number of computer science doctorates awarded i n t he Un ited St ates (98.51%, r = 0.985065), and 2) spending on science, space, and technology correlates with suicides by hanging, strangulation, and suffocation at the level of 99.79% (r = 0.99789126). 14 Even if we concede that these correlations are objective and unbiased, we are certainly not committed to accept them as important, relevant, or useful.…”
Section: Metric Mania Revisitedmentioning
confidence: 99%