2020
DOI: 10.1007/s10579-020-09489-2
|View full text |Cite
|
Sign up to set email alerts
|

Mapping languages: the Corpus of Global Language Use

Abstract: This paper describes a web-based corpus of global language use with a focus on how this corpus can be used for data-driven language mapping. First, the corpus provides a representation of where national varieties of major languages are used (e.g., English, Arabic, Russian) together with consistently collected data for each variety. Second, the paper evaluates a language identification model that supports more local languages with smaller sample sizes than alternative off-the-shelf models. Improved language ide… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
23
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
4

Relationship

3
6

Authors

Journals

citations
Cited by 21 publications
(23 citation statements)
references
References 17 publications
0
23
0
Order By: Relevance
“…While we expected some accuracy loss due to the domain mismatch between clean training data and noisy web text (Dunn, 2020), even after document-consistency filtering the LangID labels were so noisy that the corpora for the majority of languages in our crawl were unusable for any practical NLP task. Table 2 presents some representative samples of noise.…”
Section: Failure Modes Of Langid Models On Web Textmentioning
confidence: 99%
See 1 more Smart Citation
“…While we expected some accuracy loss due to the domain mismatch between clean training data and noisy web text (Dunn, 2020), even after document-consistency filtering the LangID labels were so noisy that the corpora for the majority of languages in our crawl were unusable for any practical NLP task. Table 2 presents some representative samples of noise.…”
Section: Failure Modes Of Langid Models On Web Textmentioning
confidence: 99%
“…Naturally, LangID systems have been applied to web crawls before: Buck et al (2014) published n-gram language models for 175 languages based on Common Crawl data. The Corpora Collection at Leipzig University (Goldhahn et al, 2012) and the Corpus of Global Language Use (Dunn, 2020) offer corpora in 252 and 148 languages. The largest language coverage is probably An Crúbadán, which does not leverage LangID, and found (small amounts of) web data in about 2,000 languages (Scannell, 2007).…”
Section: Previous Implementationsmentioning
confidence: 99%
“…A similar framework of LanguageCrawl [69] filtered out language-specific webpages from CCC using CLD2 to develop Word2Vec language models of various languages. In addition, Parvaz and Megerdoomian [51] developed multilingual corpora of 148 languages and 423 billion tokens from CCC. Despite the need for high compute and storage, few efforts have been made to crawl the World-Wide-Web (WWW) to develop text corpora of low-resource languages.…”
Section: Related Workmentioning
confidence: 99%
“…This dataset is summarized in Table 1. The corpus contains the same amount of data per register per language (Dunn, 2020;Dunn and Adams, 2020).…”
Section: Experimental Designmentioning
confidence: 99%