Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1204
|View full text |Cite
|
Sign up to set email alerts
|

TalkSumm: A Dataset and Scalable Annotation Method for Scientific Paper Summarization Based on Conference Talks

Abstract: Currently, no large-scale training data is available for the task of scientific paper summarization. In this paper, we propose a novel method that automatically generates summaries for scientific papers, by utilizing videos of talks at scientific conferences. We hypothesize that such talks constitute a coherent and concise description of the papers' content, and can form the basis for good summaries. We collected 1716 papers and their corresponding videos, and created a dataset of paper summaries. A model trai… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
27
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 38 publications
(28 citation statements)
references
References 16 publications
1
27
0
Order By: Relevance
“…Existing datasets include CSPubSum for extractive summarization (Collins et al, 2017), ArXiv and PubMed for abstract generation (Cohan et al, 2018), and SciSummNet and CL-SciSumm (Jaidka et al, 2018;Chandrasekaran et al, 2019) datasets, which incorporate citation contexts into human-written summaries. TalkSumm (Lev et al, 2019) uses recordings of conference talks to create a distantly-supervised training set for the CL-SciSumm task.…”
Section: Tldr-prmentioning
confidence: 99%
See 1 more Smart Citation
“…Existing datasets include CSPubSum for extractive summarization (Collins et al, 2017), ArXiv and PubMed for abstract generation (Cohan et al, 2018), and SciSummNet and CL-SciSumm (Jaidka et al, 2018;Chandrasekaran et al, 2019) datasets, which incorporate citation contexts into human-written summaries. TalkSumm (Lev et al, 2019) uses recordings of conference talks to create a distantly-supervised training set for the CL-SciSumm task.…”
Section: Tldr-prmentioning
confidence: 99%
“…We use the OpenReview API 5 to collect pairs of papers and author-written TLDRs, along with the (Jaidka et al, 2018;Chandrasekaran et al, 2019), which has an additional 40 manually annotated documents and its statistics are similar to SciSummNet. ‡ Unlike the other summarization datasets presented here, TalkSumm is an automatically-constructed dataset for training; the TalkSumm-supervised model in Lev et al (2019) was evaluated using CL-SciSumm (Jaidka et al, 2018).…”
mentioning
confidence: 99%
“…The abstractive summarization data are from published papers and blogs which contain around 700 articles with an average of 31.7 sentences per summary and an average of 21.6 words per sentence. The extractive data are from Lev et al (2019) which have 1705 paper-summary pairs. For each paper, it provides a summary with 30 sentences and 990 words on average.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…The training corpus for this task includes 1705 extractive summaries, and 531 abstractive summaries of NLP/ML scientific papers. The extractive summaries are based on video talks from associated conferences (Lev et al, 2019), while the abstractive summaries are from blog posts created by NLP and ML researchers. The test set consists of 22 research papers for both extactive and abstractive summarization and task is to generate a summary of 600 words.…”
Section: Dataset Descriptionmentioning
confidence: 99%