How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature

Sun, Simeng; Shapira, Ori; Dagan, Ido; Nenkova, Ani

doi:10.18653/v1/w19-2303

Cited by 35 publications

(36 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The difference between the maximum performance at n ≈ 18 and the widely adopted baseline (Lead-N-8) is large: 4.2 ROUGE-1 F1 points. A similar effect is observed by Sun et al (2019) for document summarization. This shows that ROUGE F1 is still sensitive to summary length, and this effect should be considered during evaluation.…”

Section: Summary Lengthsupporting

confidence: 79%

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Schumann¹,

Mou²,

Lu³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Automatic sentence summarization produces a shorter version of a sentence, while preserving its most important information. A good summary is characterized by language fluency and high information overlap with the source sentence. We model these two aspects in an unsupervised objective function, consisting of language modeling and semantic similarity metrics. We search for a high-scoring summary by discrete optimization. Our proposed method achieves a new state-of-the art for unsupervised sentence summarization according to ROUGE scores. Additionally, we demonstrate that the commonly reported ROUGE F1 metric is sensitive to summary length. Since this is unwillingly exploited in recent work, we emphasize that future evaluation should explicitly group summarization systems by output length brackets. 1

show abstract

Section: Summary Lengthsupporting

confidence: 79%

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Schumann¹,

Mou²,

Lu³

et al. 2020

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…We find that the summaries generated by ExtAbsRL include more tokens (94.5) than those generated by Refresh (83.4) and NeuralTD (85.6). Sun et al (2019) recently show that, for summaries whose lengths are in the range of 50 to 110 tokens, longer summaries receive higher ROUGE-F1 scores. We believe this is the reason why ExtAbsRL has higher ROUGE scores.…”

Section: Extractive Summarisationmentioning

confidence: 99%

Better Rewards Yield Better Summaries: Learning to Summarise Without References

Böhm¹,

Gao²,

Meyer³

et al. 2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

Self Cite

View full text Add to dashboard Cite

Reinforcement Learning (RL) based document summarisation systems yield state-of-the-art performance in terms of ROUGE scores, because they directly use ROUGE as the rewards during training. However, summaries with high ROUGE scores often receive low human judgement. To find a better reward function that can guide RL to generate human-appealing summaries, we learn a reward function from human ratings on 2,500 summaries. Our reward function only takes the document and system summary as input. Hence, once trained, it can be used to train RL-based summarisation systems without using any reference summaries. We show that our learned rewards have significantly higher correlation with human ratings than previous approaches. Human evaluation experiments show that, compared to the state-of-the-art supervised-learning systems and ROUGE-as-rewards RL summarisation systems, the RL systems using our learned rewards during training generate summaries with higher human ratings. The learned reward function and our source code are available at https://github.com/yg211/ summary-reward-no-reference.

show abstract

“…Moreover, there is no clear optimal variant of ROUGE, and the exact choice can have a large impact on how a (neural) summarizer behaves when it is used as a training objective (Peyrard, 2019b). Sun et al (2019) demonstrate another shortfall of ROUGEbased evaluation: Since the metric does not adjust for summary length, a comparison between systems can be misleading if one of them is inherently worse at the task, but better tuned to the summary length that increases ROUGE.…”

Section: Rougementioning

confidence: 99%

Truth or Error? Towards systematic analysis of factual errors in abstractive summaries

Lux¹,

Sappelli²,

Larson³

2020

Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

View full text Add to dashboard Cite

This paper presents a typology of errors produced by automatic summarization systems. The typology was created by manually analyzing the output of four recent neural summarization systems. Our work is motivated by the growing awareness of the need for better summary evaluation methods that go beyond conventional overlap-based metrics. Our typology is structured into two dimensions. First, the Mapping Dimension describes surface-level errors and provides insight into word-sequence transformation issues. Second, the Meaning Dimension describes issues related to interpretation and provides insight into breakdowns in truth, i.e., factual faithfulness to the original text. Comparative analysis revealed that two neural summarization systems leveraging pretrained models have an advantage in decreasing grammaticality errors, but not necessarily factual errors. We also discuss the importance of ensuring that summary length and abstractiveness do not interfere with evaluating summary quality.

show abstract

How to Compare Summarizers without Target Length? Pitfalls, Solutions and Re-Examination of the Neural Summarization Literature

Cited by 35 publications

References 22 publications

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Discrete Optimization for Unsupervised Sentence Summarization with Word-Level Extraction

Better Rewards Yield Better Summaries: Learning to Summarise Without References

Truth or Error? Towards systematic analysis of factual errors in abstractive summaries

Contact Info

Product

Resources

About