2020
DOI: 10.48550/arxiv.2010.07100
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Re-evaluating Evaluation in Text Summarization

Abstract: Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
2
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(10 citation statements)
references
References 12 publications
0
10
0
Order By: Relevance
“…Further, it is possible when the retrieved subset extracted by the content selection mechanism are not ordered in its original form, the incoherence of the subset cascade downwards to the inal summary output, causing a drop in semantic coherence. This inding also illustrates the importance of measuring model performance in a multi-dimensional way rather than relying entirely on ROUGE score that has found to have important limitations [3,61].…”
Section: 22mentioning
confidence: 84%
See 3 more Smart Citations
“…Further, it is possible when the retrieved subset extracted by the content selection mechanism are not ordered in its original form, the incoherence of the subset cascade downwards to the inal summary output, causing a drop in semantic coherence. This inding also illustrates the importance of measuring model performance in a multi-dimensional way rather than relying entirely on ROUGE score that has found to have important limitations [3,61].…”
Section: 22mentioning
confidence: 84%
“…While both architectures were initially proposed and tested on short documents, they can be efectively adapted to summarize long documents after incorporating novel mechanisms. 3 Finding 1. Graph-based Extractive Models with Discourse Bias: Classical graph-based unsupervised extractive models have been found to sufer from picking similar sentences that results in a summary with redundant sentences [69].…”
Section: Supervised Hybridmentioning
confidence: 98%
See 2 more Smart Citations
“…We focus on five different tasks: summary evaluation, image description, dialogue and translation. For summary evaluation, we use TAC08 (Dang et al, 2008), TAC10, TAC11 (Owczarzak & Dang, 2011), RSUM (Bhandari et al, 2020) and SEVAL (Fabbri et al, 2021). For sentence-based image description we rely on FLICKR (Young et al, 2014) and for dialogue we use PersonaChat (PC) and TopicalChat (TC) (Mehri & Eskenazi, 2020).…”
Section: Datasets With Instance-level Informationmentioning
confidence: 99%