Proceedings of the 2018 Conference of the North American Chapter Of the Association for Computational Linguistics: Hu 2018
DOI: 10.18653/v1/n18-1065
|View full text |Cite
|
Sign up to set email alerts
|

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Abstract: We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries again… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
500
1
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 402 publications
(503 citation statements)
references
References 14 publications
1
500
1
1
Order By: Relevance
“…CNN-DM and AnchorContext generate fluent texts with low perplexities. While CNN-DM is trained on well-written news articles with extractive summaries [13], the high performance (low perplexity) of the Anchor-Context model can be attributed to the relatively large corpus of 10 million training examples. However, in case of AnchorContext-QB, just the addition of query bias in the snippet generation process introduces breaks in the text flow, thereby introducing a small increase in the perplexities.…”
Section: Intrinsic Evaluationmentioning
confidence: 99%
“…CNN-DM and AnchorContext generate fluent texts with low perplexities. While CNN-DM is trained on well-written news articles with extractive summaries [13], the high performance (low perplexity) of the Anchor-Context model can be attributed to the relatively large corpus of 10 million training examples. However, in case of AnchorContext-QB, just the addition of query bias in the snippet generation process introduces breaks in the text flow, thereby introducing a small increase in the perplexities.…”
Section: Intrinsic Evaluationmentioning
confidence: 99%
“…It has multiple sentences (4.0 on average) as a summary. • Newsroom (Grusky et al, 2018): contains 1.3M news articles and written summaries by authors and editors from 1998 to 2017. It has both extractive and abstractive summaries.…”
Section: Summarization Corporamentioning
confidence: 99%
“…The summarization objective is to select a handful of sentences to maximize the coverage of important content while minimizing summary redundancy. Although unsupervised methods are promising, they cannot benefit from the large-scale training data harvested from the Web (Sandhaus, 2008;Hermann et al, 2015;Grusky et al, 2018). Neural extractive summarization has focused primarily on extracting sentences (Nallapati et al, 2017;Cao et al, 2017;Isonuma et al, 2017;Tarnpradab et al, 2017;Zhou et al, 2018;Kedzie et al, 2018).…”
Section: Related Workmentioning
confidence: 99%