2021
DOI: 10.48550/arxiv.2110.01963
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal datasets: misogyny, pornography, and malignant stereotypes

Abeba Birhane,
Vinay Uday Prabhu,
Emmanuel Kahembwe

Abstract: We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset oft… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
68
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 64 publications
(69 citation statements)
references
References 43 publications
1
68
0
Order By: Relevance
“…Following Aghajanyan et al (2021) we aim to implement a transform over HTML documents to extract out to minimal-HTML, i.e., the minimal set of text that is semantically relevant for end tasks. Birhane et al (2021) gave in-depth criticisms of Common Crawl based multi-modal datasets and showed the existence of highly problematic examples (i.e., explicit images and text pairs of rape, pornography, and ethnic slurs). Given these severe ethical concerns, we opt-out of processing all of Common Crawl and instead opt into using a subset of the Common Crawl News (CC-NEWS) dataset and all of English Wikipedia.…”
Section: Datamentioning
confidence: 99%
“…Following Aghajanyan et al (2021) we aim to implement a transform over HTML documents to extract out to minimal-HTML, i.e., the minimal set of text that is semantically relevant for end tasks. Birhane et al (2021) gave in-depth criticisms of Common Crawl based multi-modal datasets and showed the existence of highly problematic examples (i.e., explicit images and text pairs of rape, pornography, and ethnic slurs). Given these severe ethical concerns, we opt-out of processing all of Common Crawl and instead opt into using a subset of the Common Crawl News (CC-NEWS) dataset and all of English Wikipedia.…”
Section: Datamentioning
confidence: 99%
“…[92,16,91]) which directly leads to toxic biases (e.g. [41,32,11]); we trained our model on YouTube, which is a moderated platform [101]. Though the content moderation might perhaps reduce overtly 'toxic' content, social media platforms like YouTube still contain harmful microagressions [15], and alt-lite to alt-right content [94].…”
Section: A2 Biases In (Pre)training Datamentioning
confidence: 99%
“…1 The images were subjected to an array of automated filters designed to remove potentially offensive content. While certainly not perfect, this substantially reduces the issues that plague other large image datasets [8,55]. We construct a multi-label dataset using these images by converting all hashtags into their corresponding canonical targets (note that a single image may have multiple hashtags).…”
Section: Hashtag Dataset Collectionmentioning
confidence: 99%