Proceedings of the Workshop on New Frontiers in Summarization 2017
DOI: 10.18653/v1/w17-4508
|View full text |Cite
|
Sign up to set email alerts
|

TL;DR: Mining Reddit to Learn Automatic Summarization

Abstract: Recent advances in automatic text summarization have used deep neural networks to generate high-quality abstractive summaries, but the performance of these models strongly depends on large amounts of suitable training data. We propose a new method for mining social media for author-provided summaries, taking advantage of the common practice of appending a "TL;DR" to long posts. A case study using a large Reddit crawl yields the Webis-TLDR-17 corpus, complementing existing corpora primarily from the news genre.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
74
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 75 publications
(74 citation statements)
references
References 10 publications
0
74
0
Order By: Relevance
“…For the former, we use the CNN/DailyMail news dataset (Hermann et al, 2015;, widely used for the task of abstractive text summarization. For the latter, we use the Webis-TLDR-17 corpus (Völske et al, 2017), automatically created using T L; DR tags on Reddit 2 . Figure 1 shows the distribution of lexical formality scores over these and the complete dataset (based on Equation 6).…”
Section: Methodsmentioning
confidence: 99%
“…For the former, we use the CNN/DailyMail news dataset (Hermann et al, 2015;, widely used for the task of abstractive text summarization. For the latter, we use the Webis-TLDR-17 corpus (Völske et al, 2017), automatically created using T L; DR tags on Reddit 2 . Figure 1 shows the distribution of lexical formality scores over these and the complete dataset (based on Equation 6).…”
Section: Methodsmentioning
confidence: 99%
“…To assess the possible benefits of reinforcing over the proposed QG-based metric, which does not require human-generated reference summaries, we employ TL;DR 2 , a large-scale dataset for automatic summarization built on social media data, compounding to 4 Million training pairs (Völske et al, 2017). Both CNN-DM and TL;DR datasets are in English.…”
Section: Data Usedmentioning
confidence: 99%
“…TL;DR Reddit corpus (Völske et al, 2017): This is the dataset for the TL;DR challenge. They pro- Extractive Summarization and Abstractive Summarization modules are finetuned on each datasets for obtaining respetive results.…”
Section: Datasets and Experimental Setupmentioning
confidence: 99%
“…TL;DR Reddit corpus (Völske et al, 2017): This is the dataset for the TL;DR challenge. They pro-Algorithm 2 Order Preserving Selection 1: A = list(< sentence, id, score >) 2: procedure REORDER(A) 3: sortedA = sortByScore(A) et al (2015), Gavrilov (2017) and the PGN implementation of paper See et al (2017), Kumar (2019) are used as references for abstractive module.…”
Section: Datasets and Experimental Setupmentioning
confidence: 99%