Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing 2021
DOI: 10.18653/v1/2021.emnlp-main.797
|View full text |Cite
|
Sign up to set email alerts
|

MassiveSumm: a very large-scale, very multilingual, news summarisation dataset

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 23 publications
(20 citation statements)
references
References 15 publications
0
12
0
Order By: Relevance
“…Conventional cross-lingual summarization methods mainly focus on incorporating bilingual information into the pipeline methods (Leuski et al, 2003;Ouyang et al, 2019;Orȃsan and Chiorean, 2008;Wan et al, 2010;Wan, 2011;Yao et al, 2015;Zhang et al, 2016b), i.e., translation and then summarization or summarization and then translation. Due to the difficulty of acquiring cross-lingual summarization dataset, some previous researches focus on constructing datasets (Ladhak et al, 2020;Scialom et al, 2020;Yela-Bello et al, 2021;Zhu et al, 2019;Hasan et al, 2021;Perez-Beltrachini and Lapata, 2021;Varab and Schluter, 2021), mixed-lingual pre-training (Xu et al, 2020), knowledge distillation (Nguyen and Tuan, 2021), contrastive learning (Wang et al, 2021) or zero-shot approaches (Ayana et al, 2018;Duan et al, 2019;Dou et al, 2020), i.e., using machine translation (MT) or monolingual summarization (MS) or both to train the CLS system. Among them, Zhu et al (2019) propose to use roundtrip translation strategy to obtain large-scale CLS datasets and then present two multi-task learning methods for CLS.…”
Section: Related Workmentioning
confidence: 99%
“…Conventional cross-lingual summarization methods mainly focus on incorporating bilingual information into the pipeline methods (Leuski et al, 2003;Ouyang et al, 2019;Orȃsan and Chiorean, 2008;Wan et al, 2010;Wan, 2011;Yao et al, 2015;Zhang et al, 2016b), i.e., translation and then summarization or summarization and then translation. Due to the difficulty of acquiring cross-lingual summarization dataset, some previous researches focus on constructing datasets (Ladhak et al, 2020;Scialom et al, 2020;Yela-Bello et al, 2021;Zhu et al, 2019;Hasan et al, 2021;Perez-Beltrachini and Lapata, 2021;Varab and Schluter, 2021), mixed-lingual pre-training (Xu et al, 2020), knowledge distillation (Nguyen and Tuan, 2021), contrastive learning (Wang et al, 2021) or zero-shot approaches (Ayana et al, 2018;Duan et al, 2019;Dou et al, 2020), i.e., using machine translation (MT) or monolingual summarization (MS) or both to train the CLS system. Among them, Zhu et al (2019) propose to use roundtrip translation strategy to obtain large-scale CLS datasets and then present two multi-task learning methods for CLS.…”
Section: Related Workmentioning
confidence: 99%
“…For instance, Wikipedia has been used as a resource to derive multilingual benchmarks (Botha et al, 2020;Liu et al, 2019a;Pan et al, 2017;Rahimi et al, 2019), and several multilingual summarisation datasets have been created by extracting article-summary pairs from online newspapers or how-to guides (e.g. Hasan et al, 2021;Ladhak et al, 2020;Nguyen and Daumé III, 2019;Scialom et al, 2020;Varab and Schluter, 2021). Various linguistic resources have also been exploited: for instance, the Universal Dependencies treebank (Nivre et al, 2020) has been used to evaluate cross-lingual part-of-speech tagging, and multilingual WordNet and Wiktionary have been used to build XL-WiC (Raganato et al, 2020), an extension of WiC (Pilehvar and Camacho-Collados, 2019) that reformulates word sense disambiguation in 12 languages as a binary classification task.…”
Section: Generalisation Across Languagesmentioning
confidence: 99%
“…Even though it is a large highquality resource of parallel data for cross-lingual summarization, this corpus is built from how-to guides: our dataset focuses instead on scholarly documents. Besides cross-lingual corpora, there are also large-scale multilingual summarization datasets for the news domain [48,52]. The work we present here differs in that we focus on extreme summarization for the scholarly domain and we look specifically at the problem of cross-lingual summarization in which source and target language differ.…”
Section: Related Workmentioning
confidence: 99%
“…Just like in virtually all areas of NLP research, most successful approaches to summarization rely on neural techniques using supervision from labeled data. This includes neural models to summarize documents in general domains such as news articles [33,49], including cross-and multi-lingual models and datasets [48,52], as well as specialized ones e.g., the biomedical domain [39].…”
Section: Introductionmentioning
confidence: 99%