2022
DOI: 10.48550/arxiv.2201.10474
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data Selection

Abstract: Language models increasingly rely on massive web dumps for diverse text data. However, these sources are rife with undesirable content. As such, resources like Wikipedia, books, and newswire often serve as anchors for automatically selecting web text most suitable for language modeling, a process typically referred to as quality filtering. Using a new dataset of U.S. high school newspaper articles-written by students from across the country-we investigate whose language is preferred by the quality filter used … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 10 publications
(12 citation statements)
references
References 31 publications
0
12
0
Order By: Relevance
“…Relatedly, these studies do not explore demographic differences among participants as they relate to effects on perceived agency. However, as noted previously, large language models tend to privilege the language of over-represented groups such as affluent white men [48]. Future research could make necessary contributions to the study of human-AI interaction by investigating similarities and differences between the current findings and research on specific subpopulations.…”
Section: Limitationsmentioning
confidence: 71%
See 1 more Smart Citation
“…Relatedly, these studies do not explore demographic differences among participants as they relate to effects on perceived agency. However, as noted previously, large language models tend to privilege the language of over-represented groups such as affluent white men [48]. Future research could make necessary contributions to the study of human-AI interaction by investigating similarities and differences between the current findings and research on specific subpopulations.…”
Section: Limitationsmentioning
confidence: 71%
“…Additionally, recent research suggests that large language models such as GPT-3 privilege and reproduce the language of people from "wealthier, educated, and urban ZIP codes" [48]. Taken in combination with participants' varying levels of technological literacy in the current study, it is unlikely that the societal biases reproduced in text-based AI systems are immediately salient to everyone using them.…”
Section: Discussionmentioning
confidence: 82%
“…The dominance of English, and to a lesser degree Chinese, reifies cultural hegemonies and precipitates technological imperialism. Even when researchers seek to include other languages, these purportedly multilingual models often underserve certain languages and communities (Kerrison et al, 2018;Virtanen et al, 2019;Kreutzer et al, 2022;Gururangan et al, 2022). We also note that few of these models have been assessed for bias or fairness (see table 1).…”
Section: Language Is Multicultural Language Models Are Notmentioning
confidence: 98%
“…Data Voids: Social disparities in literacy and internet access might cause entire communities to be excluded from language data (Sambasivan et al, 2021). Further, the risk of unintentionally excluding marginalized communities based on dialect or other linguistic features while filtering data to ensure quality (Dodge et al, 2021;Gururangan et al, 2022) is even higher in the Indian context because of very limited computational representation of marginalized communities. Accounting for data voids and intentional data curation (e.g., collecting language data specifically from marginalized communities (Abraham et al, 2020;Nekoto et al, 2020)) can significantly help bridge this gap.…”
Section: Accounting For Indian Societal Contextmentioning
confidence: 99%