Proceedings of the Third Workshop on New Frontiers in Summarization 2021
DOI: 10.18653/v1/2021.newsum-1.5
|View full text |Cite
|
Sign up to set email alerts
|

A Novel Wikipedia based Dataset for Monolingual and Cross-Lingual Summarization

Abstract: Cross-lingual summarization is a challenging task for which there are no cross-lingual scientific resources currently available. To overcome the lack of a high-quality resource, we present a new dataset for monolingual and cross-lingual summarization considering the English-German pair. We collect high-quality, real-world cross-lingual data from Spektrum der Wissenschaft, which publishes humanwritten German scientific summaries of English science articles on various subjects. The generated Spektrum dataset is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 19 publications
0
6
0
Order By: Relevance
“…Cross-lingual scientific summarization is an understudied area due to its challenging nature. We find two studies: a synthetic dataset from English to Somali, Swahili, and Tagalog with round trip translation (Ouyang et al, 2019), two real cross-lingual datasets from Wikipedia Science Portal and Spektrum der Wissenschaft for English-German (Fatima and Strube, 2021).…”
Section: Scientific Summarizationmentioning
confidence: 99%
See 2 more Smart Citations
“…Cross-lingual scientific summarization is an understudied area due to its challenging nature. We find two studies: a synthetic dataset from English to Somali, Swahili, and Tagalog with round trip translation (Ouyang et al, 2019), two real cross-lingual datasets from Wikipedia Science Portal and Spektrum der Wissenschaft for English-German (Fatima and Strube, 2021).…”
Section: Scientific Summarizationmentioning
confidence: 99%
“…These datasets are, unfortunately, not suitable for cross-lingual science journalism. Moreover, crosslingual science journalism has been investigated as a fusion of cross-lingual summarization and text simplification with a pipeline model (Fatima and Strube, 2023) with cross-lingual scientific datasets (Fatima and Strube, 2021). In the dawn of cross-lingual summarization, various pipeline models (Ouyang et al, 2019;Zhu et al, 2019Zhu et al, , 2020 with synthetic cross-lingual datasets have been introduced to explore the task.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…The work closest to ours has been recently presented by Fatima and Strube [16], who introduce an English-German cross-lingual summarization dataset collected from German scientific magazines and Wikipedia. This resource is complementary to ours in many different aspects.…”
Section: Related Workmentioning
confidence: 99%
“…Our dataset consists of two main portions: a) a translated version of the original dataset from Cachola et al [5] in German, Italian and Chinese to enable comparability across languages on the basis of post-edited automatic translations; b) a dataset of human-generated TLDRs in Japanese from a community-based summarization platform to test performance on a second, comparable human-generated dataset. Our work complements seminal efforts from Fatima and Strube [16], who compile an English-German cross-lingual dataset from the Spektrum der Wissenschaft / Scientific American and Wikipedia, in that we focus on extreme summarization, build a dataset of expertderived multilingual TLDRs (as opposed to leads from Wikipedia), and provide additional languages.…”
Section: Introductionmentioning
confidence: 99%