2022
DOI: 10.48550/arxiv.2204.10216
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

Abstract: How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by systemlevel correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard pra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 5 publications
1
3
0
Order By: Relevance
“…This is mostly due to the inflexibility of computing the F1 score sorely based on the similarity between the generated responses and golden references. Such observations echo findings from previous studies (Madotto et al, 2019;Liu et al, 2016;Deutsch et al, 2022). In this work, we still utilize F1 score since it's a standardized metric for PERSONA-CHAT evaluation.…”
Section: Limitationssupporting
confidence: 83%
“…This is mostly due to the inflexibility of computing the F1 score sorely based on the similarity between the generated responses and golden references. Such observations echo findings from previous studies (Madotto et al, 2019;Liu et al, 2016;Deutsch et al, 2022). In this work, we still utilize F1 score since it's a standardized metric for PERSONA-CHAT evaluation.…”
Section: Limitationssupporting
confidence: 83%
“…Deutsch et al (2021) investigated the preciseness of correlations between metrics and human annotations in meta-evaluation benchmarks, and proposed approaches to improve the level of confidence. Finally, Deutsch et al (2022) discussed ways of improving the reliability of system-level correlations in meta-evaluation.…”
Section: Background and Related Workmentioning
confidence: 99%
“…However, such system-level numbers are not very informative when one is interested in evaluating the absolute performance of inconsistency detection methods that perform a binary decision w.r.t each input. Deutsch et al (2022) also recently discuss various issues in measuring system-level correlations to assess the validity of automatic evaluation metrics for summarization.…”
Section: Meta-evaluationmentioning
confidence: 99%