Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.647
|View full text |Cite
|
Sign up to set email alerts
|

MLSUM: The Multilingual Summarization Corpus

Abstract: We present MLSUM, the first large-scale Mul-tiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages -namely, French, German, Spanish, Russian, Turkish. Together with English news articles from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
61
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 98 publications
(83 citation statements)
references
References 37 publications
0
61
0
Order By: Relevance
“…MLSUM is the first large-scale MultiLingual SUMmarization dataset which contains 1.5M+ article/summary pairs including Turkish (Scialom et al, 2020). The authors compiled the dataset following the same methodology of CNN/DailyMail dataset.…”
Section: Datasetmentioning
confidence: 99%
See 1 more Smart Citation
“…MLSUM is the first large-scale MultiLingual SUMmarization dataset which contains 1.5M+ article/summary pairs including Turkish (Scialom et al, 2020). The authors compiled the dataset following the same methodology of CNN/DailyMail dataset.…”
Section: Datasetmentioning
confidence: 99%
“…The data was split into train, validation and test sets, with respect to the publication dates. The data from 2010 to 2018 was used for training; data between January-April 2019 was used for validation; and data up to December 2019 was used for test (Scialom et al, 2020). In this study, we obtained the Turkish dataset from HuggingFace collection.…”
Section: Datasetmentioning
confidence: 99%
“…With the large success brought by pre-trained language models in English abstractive summarization (Liu and Lapata, 2019;Lewis et al, 2020b;Zhang et al, 2020), several works focus on summarization in multiple languages. Nguyen and Daumé III (2019) constructs a small cross-lingual dataset with English summaries for non-English articles, and Scialom et al (2020) proposes MLSUM with 5 languages as the extended version of English summarization dataset CNN/DailyMail (Hermann et al, 2015). Cao et al (2020) use a Transformerbased model with 6 layers encoder and decoder to combine auto-encoder training, translation and summarization.…”
Section: Related Workmentioning
confidence: 99%
“…It lacks the ability to align sentence-level information among languages and to distinguish which information is the most critical for the document-level input. Most previous multilingual summarization models focus on training one model for different language or partly share encoder/decoder layers (Wang et al, 2018;Lin et al, 2018;Scialom et al, 2020). Cao et al (2020) and Lewis et al (2020a) try to train one model for all languages, but they find that although low-resource languages can benefit from the larger training data, the performance of rich-resource languages has been sacrificed.…”
Section: Introductionmentioning
confidence: 99%
“…We also use MT learning in generation tasks, the tasks are extractive; i.e., the output often has significant overlap with the input. These tasks include news title generation, text summarization, and question generation (Chi et al, 2020;Liang et al, 2020;Scialom et al, 2020). Reply suggestion is more challenging because the reply often does not overlap with the message (Figure 1), so the model needs to address different cross-lingual generalization challenges (Section 5.2).…”
mentioning
confidence: 99%