Atabak Ashfaq scite author profile

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both systemlevel and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems. We release a dataset of human judgments that are collected from 25 top-scoring neural summarization systems (14 abstractive and 11 extractive):

show abstract

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

Bhandari¹,

Gour²,

Ashfaq³

et al. 2020

View full text Add to dashboard Cite

In text summarization, evaluating the efficacy of automatic metrics without human judgments has become recently popular. One exemplar work (Peyrard, 2019) concludes that automatic metrics strongly disagree when ranking high-scoring summaries. In this paper, we revisit their experiments and find that their observations stem from the fact that metrics disagree in ranking summaries from any narrow scoring range. We hypothesize that this may be because summaries are similar to each other in a narrow scoring range and are thus, difficult to rank. Apart from the width of the scoring range of summaries, we analyze three other properties that impact inter-metric agreement -Ease of Summarization, Abstractiveness, and Coverage. To encourage reproducible research, we make all our analysis code and data publicly available. 1 IntroductionAutomatic metrics play a significant role in summarization evaluation, profoundly affecting the direction of system optimization. Due to its importance, evaluating the quality of evaluation metrics, also known as meta-evaluation has been a crucial step. Generally, there are two meta-evaluation strategies: (i) assessing how well each metric correlates with human judgments (

show abstract

Re-evaluating Evaluation in Text Summarization

Bhandari¹,

Gour²,

Ashfaq³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

Bhandari

Gour

Ashfaq

et al. 2020

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Atabak Ashfaq

Re-evaluating Evaluation in Text Summarization

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

Re-evaluating Evaluation in Text Summarization

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

Contact Info

Product

Resources

About