2021
DOI: 10.1016/j.patter.2021.100336
|View full text |Cite
|
Sign up to set email alerts
|

Data and its (dis)contents: A survey of dataset development and use in machine learning research

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
135
0
2

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

0
10

Authors

Journals

citations
Cited by 325 publications
(181 citation statements)
references
References 79 publications
2
135
0
2
Order By: Relevance
“…In this final section, we also echo other critical work in machine learning (Paullada et al, 2021;Hutchinson et al, 2021) and argue that understanding (speech) datasets as increasingly important infrastructure is useful. It allows us to reframe the task of speech technology development from one primarily done by corporations for markets to one done by a wider range of actors for speech communities.…”
Section: Towards Better Practicessupporting
confidence: 71%
“…In this final section, we also echo other critical work in machine learning (Paullada et al, 2021;Hutchinson et al, 2021) and argue that understanding (speech) datasets as increasingly important infrastructure is useful. It allows us to reframe the task of speech technology development from one primarily done by corporations for markets to one done by a wider range of actors for speech communities.…”
Section: Towards Better Practicessupporting
confidence: 71%
“…Machine learning, computer vision, and social media studies often use "found" data [Hemphill et al 2021;Jo and Gebru 2020;Paullada et al 2021] and render curatorial decisions such as "what data should be available, " "in which format(s) should data be provided, " or "how should this data be sampled" invisible. For instance, datasets scraped from the web (such as Flickr photos [Scheuerman et al 2021;Zhang et al 2015] or Wikipedia talk pages [Wulczyn et al 2016[Wulczyn et al , 2017) suffer from biases in representation [Jo and Gebru 2020].…”
Section: What Renders Data Curation Invisible?mentioning
confidence: 99%
“…Finally, a common opinion in machine learning [ 96 ] has been that, given enough data and capacity, machine learning bias generally has a vanishing influence over the resulting bias in the learned solution. On the contrary, scale can obfuscate [ 82 ] misspecifications in the task and/or data collection design [ 97 , 98 ]. Here, we focused on how misspecifications in the algorithm design for anomaly detection can result in gross failure even in the ideal theoretical settings of infinite data and capacity.…”
Section: Broader Impactmentioning
confidence: 99%