Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer 2021
DOI: 10.18653/v1/2021.acl-short.28
|View full text |Cite
|
Sign up to set email alerts
|

WikiSum: Coherent Summarization Dataset for Efficient Human-Evaluation

Abstract: Recent works have made significant advances on summarization tasks, facilitated by summarization datasets. Several existing datasets have the form of coherent-paragraph summaries. However, these datasets were curated from academic documents written for experts, making the essential step of assessing the summarization output through human-evaluation very demanding.To overcome these limitations, we present a dataset 1 based on article summaries appearing on the WikiHow website, composed of howto articles and coh… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(4 citation statements)
references
References 13 publications
0
4
0
Order By: Relevance
“…In addition to the Amazon Reviews dataset, we experimented on the WikiSum dataset (Cohen et al, 2021) to further validate our findings. The Wik-iSum dataset is a coherent paragraph summarization dataset based on the WikiHow website.…”
Section: Methodsmentioning
confidence: 91%
See 1 more Smart Citation
“…In addition to the Amazon Reviews dataset, we experimented on the WikiSum dataset (Cohen et al, 2021) to further validate our findings. The Wik-iSum dataset is a coherent paragraph summarization dataset based on the WikiHow website.…”
Section: Methodsmentioning
confidence: 91%
“…For example, SQuAD (Rajpurkar et al, 2016), a widely used question-answering dataset composed of Wikipedia articles from multiple domains, is often referred to as a single-domain dataset in domain adaptation works for simplicity (Hazen et al, 2019;Shakeri et al, 2020;Yue et al, 2021). This scenario is also common in text summarization, where many datasets consist of a bundle of domains for news articles (Grusky et al, 2018), academic papers (Cohan et al, 2018;Fonseca et al, 2022), and do-it-yourself (DIY) guides (Cohen et al, 2021). While models that learn from multiple domains are frequently used, their nature and behavior have hardly been explored.…”
Section: Introductionmentioning
confidence: 99%
“…Here, we show how to leverage the proposed QG framework to improve closed-book QA tasks on seen data (Wi-kiCQA) and unseen data (GooAQ and ELI5). Since freely available summary data is a good resource to generate synthetic data (Lyu et al, 2021), we use WikiSum (Cohen et al, 2021), which contains 39,775 coherent-paragraph summaries written by the article's authors on the WikiHow website. We take each sentence from the article summary as an answer and pass it into the best QG model, described in Section 3.1, to generate a question.…”
Section: Synthetic Data Generation (Rq3)mentioning
confidence: 99%
“…Summaries are coherent paragraphs as tips written by the document authors in a friendly manner. Therefore, its content is highly readable and easily comprehensive for readers [26].…”
Section: Other Summarization Datasetsmentioning
confidence: 99%