2022
DOI: 10.1162/tacl_a_00447
|View full text |Cite
|
Sign up to set email alerts
|

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Abstract: With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences o… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
49
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 60 publications
(49 citation statements)
references
References 39 publications
0
49
0
Order By: Relevance
“…While the authors of mC4 still perform text deduplication, language detection and bad words removal, they lower the language detection threshold to 70% and omit other useful heuristics due to the great variability across the covered character systems, such as filtering sentences having non-standard end-of-sentence punctuation. As a consequence, the resulting corpus has an overall lower quality, with a recent study finding 16% of examples in a random sample of mC4 being associated to the wrong language tag, and 11% of them not containing any linguistic information (Kreutzer et al, 2022).…”
Section: Data and Model Pretrainingmentioning
confidence: 99%
“…While the authors of mC4 still perform text deduplication, language detection and bad words removal, they lower the language detection threshold to 70% and omit other useful heuristics due to the great variability across the covered character systems, such as filtering sentences having non-standard end-of-sentence punctuation. As a consequence, the resulting corpus has an overall lower quality, with a recent study finding 16% of examples in a random sample of mC4 being associated to the wrong language tag, and 11% of them not containing any linguistic information (Kreutzer et al, 2022).…”
Section: Data and Model Pretrainingmentioning
confidence: 99%
“…One of the main repositories of parallel data for MT is OPUS (Tiedemann, 2012), which includes many multilingual parallel datasets ranging in domains, languages and sizes. Catalan is included in many of these large webcrawled datasets, however, as Kreutzer et al (2021) point out, most data coming from online sources is of poor quality. Hence, the importance of high quality data curation.…”
Section: Related Workmentioning
confidence: 99%
“…Nonetheless, we are aware that the quality of the datasets varies greatly, since an automatic alignment and manual revision yield very different results. CCaligned, for instance, has been shown to have poor quality (Kreutzer et al, 2021). 2021b) pipeline to process the WARC files obtained from the crawling.…”
Section: Machine Translationmentioning
confidence: 99%
“…To deal with the demands of deep learning, data curators and researchers have turned to enormous internet-scraped datasets such as Common Crawl Corpus or WebText. As these unstructured corpora become larger, the risk of them containing harmful content increases, and the larger the dataset, the more difficult it is for humans to explore what is in the dataset and audit for quality or toxicity (Hanna and Park, 2020;Luccioni and Viviano, 2021;Kreutzer et al, 2022).…”
Section: Harms and Risks In Nlp Datamentioning
confidence: 99%
“…However, work in seemingly unrelated NLP domains (e.g. NLG, part-of-speech tagging, or semantic search) may still encounter spurious harms in datasets, especially if these are large-scale and scraped from internet sources (Luccioni and Viviano, 2021;Dodge et al, 2021;Kreutzer et al, 2022).…”
Section: Introductionmentioning
confidence: 99%