Proceedings of the Workshop on Methods for Optimizing and Evaluating Neural Language Generation 2019
DOI: 10.18653/v1/w19-2303
|View full text |Cite
|
Sign up to set email alerts
|

How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature

Abstract: Until recently, summarization evaluations compared systems that produce summaries of the same target length. Neural approaches to summarization however have done away with length requirements. Here we present detailed experiments demonstrating that summaries of different length produced by the same system have a clear non-linear pattern of quality as measured by ROUGE F1 scores: initially steeply improving with summary length, then starting to gradually decline. Neural models produce summaries of different len… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
33
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
6
4

Relationship

1
9

Authors

Journals

citations
Cited by 35 publications
(36 citation statements)
references
References 22 publications
3
33
0
Order By: Relevance
“…The difference between the maximum performance at n ≈ 18 and the widely adopted baseline (Lead-N-8) is large: 4.2 ROUGE-1 F1 points. A similar effect is observed by Sun et al (2019) for document summarization. This shows that ROUGE F1 is still sensitive to summary length, and this effect should be considered during evaluation.…”
Section: Summary Lengthsupporting
confidence: 79%
“…The difference between the maximum performance at n ≈ 18 and the widely adopted baseline (Lead-N-8) is large: 4.2 ROUGE-1 F1 points. A similar effect is observed by Sun et al (2019) for document summarization. This shows that ROUGE F1 is still sensitive to summary length, and this effect should be considered during evaluation.…”
Section: Summary Lengthsupporting
confidence: 79%
“…We find that the summaries generated by ExtAbsRL include more tokens (94.5) than those generated by Refresh (83.4) and NeuralTD (85.6). Sun et al (2019) recently show that, for summaries whose lengths are in the range of 50 to 110 tokens, longer summaries receive higher ROUGE-F1 scores. We believe this is the reason why ExtAbsRL has higher ROUGE scores.…”
Section: Extractive Summarisationmentioning
confidence: 99%
“…Moreover, there is no clear optimal variant of ROUGE, and the exact choice can have a large impact on how a (neural) summarizer behaves when it is used as a training objective (Peyrard, 2019b). Sun et al (2019) demonstrate another shortfall of ROUGEbased evaluation: Since the metric does not adjust for summary length, a comparison between systems can be misleading if one of them is inherently worse at the task, but better tuned to the summary length that increases ROUGE.…”
Section: Rougementioning
confidence: 99%