Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.428
|View full text |Cite
|
Sign up to set email alerts
|

TLDR: Extreme Summarization of Scientific Documents

Abstract: We introduce TLDR generation, a new form of extreme summarization, for scientific papers. TLDR generation involves high source compression and requires expert background knowledge and understanding of complex domain-specific language. To facilitate study on this task, we introduce SCITLDR, a new multi-target dataset of 5.4K TLDRs over 3.2K papers. SCITLDR contains both author-written and expert-derived TLDRs, where the latter are collected using a novel annotation protocol that produces high-quality summaries … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
117
0
3

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 118 publications
(121 citation statements)
references
References 38 publications
1
117
0
3
Order By: Relevance
“…In this work, we explore the benefits of intermediate pretraining using existing summarization datasets for a target task involving the summarization of scientific articles. We obtain improvements in performance over state-of-the-art extractive summarization baseline systems on a new sci-entific summarization benchmark, SCITLDR (Cachola et al, 2020). We also make the following key observations:…”
Section: Introductionmentioning
confidence: 69%
See 1 more Smart Citation
“…In this work, we explore the benefits of intermediate pretraining using existing summarization datasets for a target task involving the summarization of scientific articles. We obtain improvements in performance over state-of-the-art extractive summarization baseline systems on a new sci-entific summarization benchmark, SCITLDR (Cachola et al, 2020). We also make the following key observations:…”
Section: Introductionmentioning
confidence: 69%
“…We evaluate the models on two scientific summarization benchmark datasets-Pubmed (Cohan et al, 2018) and SCITLDR (Cachola et al, 2020). We use the CNN/DM (Hermann et al, 2015) dataset for intermediate pretraining.…”
Section: Summarization Datasetsmentioning
confidence: 99%
“…We consider the abstractive document-to-slide generation task as a query-based single-document text summarization (QSS) task. Although there has been increasing interest in constructing largescale single-document text summarization corpora (CNN/DM (Hermann et al, 2015;Nallapati et al, 2016), Newsroom (Grusky et al, 2018), XSum (Narayan et al, 2018), TLDR (Cachola et al, 2020)) and developing various approaches to address this task (Pointer Generator (See et al, 2017), Bottom-Up (Gehrmann et al, 2018), BERTSum (Liu and Lapata, 2019)), QSS remains a relatively unexplored field. Most studies on query-based text summarization focus on the multi-document level (Dang, 2005;Baumel et al, 2016) and use extractive approaches (Feigenblat et al, 2017;Xu and Lapata, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…Although summarization for scientific texts is less explored, Cohan et al (2018) proposed hierarchical encoder-decoder network to address the long scholarly documents for constructing abstract summary, suggested summarization using abstract and citation sentences with graph convolutional networks (Kipf and Welling, 2016) and LSTM, and released the medium-scale dataset that contains 1000 scientific papers in the computational linguistic domain with humanwritten summaries and citation sentences for each paper. Cachola et al (2020) implemented an extreme summarization system, which is TLDR (Too Long; Don't Read) summarization, for scientific documents using multi-task learning with headline generation models (Vasilyev et al, 2019). Zhang et al (2019b) proposed PEGASUS by masking important sentences in the input document with a Transformer-based encoder-decoder network to force the model to summarize main points of the contents given the remainder of the text.…”
Section: Related Workmentioning
confidence: 99%
“…Developing human-written lay summaries for scholarly documents is challenging since it involves expert knowledge to understand the technical jargon and the complex structure of scientific documents. Because of these inherent challenges, existing summarization techniques for scientific documents is limited in a sense, which the produced summary is either too concise to provide important information (Vasilyev et al, 2019;Cachola et al, 2020) or aiming to directly extract the content from abstract or citation sentences , which mostly resembles the abstract, making it hard for the public and researchers from outside of the particular domain to understand the main points of the scientific papers. Although the readability of the abstracts in scientific papers had continuously decreased due to the increase in the use of technical jargon (Plavén-Sigray et al, 2017), the summarization of scientific papers for the public and researchers from outside of the certain field has been remained elusive.…”
Section: Introductionmentioning
confidence: 99%