Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen 2019
DOI: 10.18653/v1/d19-1051
|View full text |Cite
|
Sign up to set email alerts
|

Neural Text Summarization: A Critical Evaluation

Abstract: Text summarization aims at compressing long documents into a shorter form that conveys the most important parts of the original document. Despite increased interest in the community and notable research effort, progress on benchmark datasets has stagnated. We critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlight three primary shortcomings: 1) automatically collected datasets leave the task underconstrained and may contain noise detrimental to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

11
286
1
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 277 publications
(299 citation statements)
references
References 51 publications
11
286
1
1
Order By: Relevance
“…Most automatic evaluation protocols, e.g., ROUGE and BERTScore [154], are not sufficient to evaluate the overall quality of generated summaries [70,91]. We still have to access some critical features, like factual correctness [70], fluency, and relevance [19], of generated summaries by human experts. Thus, a future research direction along this line is building better evaluation systems that go beyond current metrics to capture the most important features which agree with humans.…”
Section: Experiments On Newsroom and Bytecup Datasetsmentioning
confidence: 99%
“…Most automatic evaluation protocols, e.g., ROUGE and BERTScore [154], are not sufficient to evaluate the overall quality of generated summaries [70,91]. We still have to access some critical features, like factual correctness [70], fluency, and relevance [19], of generated summaries by human experts. Thus, a future research direction along this line is building better evaluation systems that go beyond current metrics to capture the most important features which agree with humans.…”
Section: Experiments On Newsroom and Bytecup Datasetsmentioning
confidence: 99%
“…the input text (Cao et al, 2018). Automatic metrics used to evaluate text generation, such as ROUGE and BERTScore (Zhang et al, 2020), are not correlated with the factual consistency or faithfulness of the generated text (Falke et al, 2019;Kryściński et al, 2019). To address this, recent work has studied the use of textual entailment models to rank and filter non-factual generations (Falke et al, 2019;Maynez et al, 2020).…”
Section: Generaandonmentioning
confidence: 99%
“…The work also showed that ROUGE scores do not correlate with factual correctness, emphasizing that ROUGE based evaluation alone is not enough for summarization task. In addition, Kryscinski et al (2019a) pointed out that current evaluation protocols correlate weakly with human judgements and do not take factual correctness into account. Maynez et al (2020) conducted a large scale human evaluation on the generated summaries of various abstractive summarization systems and found substantial amounts of hallucinated content in those summaries.…”
Section: Related Workmentioning
confidence: 99%
“…Abstractive summarization has attracted increasing attention recently, thanks to the availability of large-scale datasets (Sandhaus, 2008;Hermann et al, 2015;Grusky et al, 2018;Narayan et al, 2018a) and advances on neural architectures (Sutskever et al, 2014;Bahdanau et al, 2015a;Vinyals et al, 2015;Vaswani et al, 2017). Although modern abstractive summarization systems generate relatively fluent summaries, recent work has called attention to the problem they have with factual inconsistency (Kryscinski et al, 2019a). That is, they produce summaries that contain hallucinated facts that are not supported by the source text.…”
Section: Introductionmentioning
confidence: 99%