2021

DOI: 10.48550/arxiv.2110.01963

|View full text |Cite

Preprint

|

Sign up to set email alerts

|

Multimodal datasets: misogyny, pornography, and malignant stereotypes

Abeba Birhane,

Vinay Uday Prabhu,

Emmanuel Kahembwe

Abstract: We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset oft… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction2

Data1

A2 Biases In (Pre)training Data1

Hashtag Dataset Collection1

Citation Types

Supporting

1

Mentioning

68

Contrasting

0

Year Published

2021

2021

2024

2024

Publication Types

Select...

Other5

Book3

Article2

Relationship

Self Cite0

Independent10

Authors

Journals

Cited by 64 publications

(69 citation statements)

References 43 publications

Supporting

1

Mentioning

68

Contrasting

0

Order By: Relevance

“…Following Aghajanyan et al (2021) we aim to implement a transform over HTML documents to extract out to minimal-HTML, i.e., the minimal set of text that is semantically relevant for end tasks. Birhane et al (2021) gave in-depth criticisms of Common Crawl based multi-modal datasets and showed the existence of highly problematic examples (i.e., explicit images and text pairs of rape, pornography, and ethnic slurs). Given these severe ethical concerns, we opt-out of processing all of Common Crawl and instead opt into using a subset of the Common Crawl News (CC-NEWS) dataset and all of English Wikipedia.…”

Section: Datamentioning

confidence: 99%

CM3: A Causal Masked Multimodal Model of the Internet

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked languageimage models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multimodal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM De Cao et al., 2020;Aghajanyan et al., 2021). We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

“…Following Aghajanyan et al (2021) we aim to implement a transform over HTML documents to extract out to minimal-HTML, i.e., the minimal set of text that is semantically relevant for end tasks. Birhane et al (2021) gave in-depth criticisms of Common Crawl based multi-modal datasets and showed the existence of highly problematic examples (i.e., explicit images and text pairs of rape, pornography, and ethnic slurs). Given these severe ethical concerns, we opt-out of processing all of Common Crawl and instead opt into using a subset of the Common Crawl News (CC-NEWS) dataset and all of English Wikipedia.…”

Section: Datamentioning

confidence: 99%

CM3: A Causal Masked Multimodal Model of the Internet

et al. 2022

Preprint

View full text Add to dashboard Cite

We introduce CM3, a family of causally masked generative models trained over a large corpus of structured multi-modal documents that can contain both text and image tokens. Our new causally masked approach generates tokens left to right while also masking out a small number of long token spans that are generated at the end of the string, instead of their original positions. The casual masking object provides a type of hybrid of the more common causal and masked language models, by enabling full generative modeling while also providing bidirectional context when generating the masked spans. We train causally masked languageimage models on large-scale web and Wikipedia articles, where each document contains all of the text, hypertext markup, hyperlinks, and image tokens (from a VQVAE-GAN), provided in the order they appear in the original HTML source (before masking). The resulting CM3 models can generate rich structured, multimodal outputs while conditioning on arbitrary masked document contexts, and thereby implicitly learn a wide range of text, image, and cross modal tasks. They can be prompted to recover, in a zero-shot fashion, the functionality of models such as DALL-E, GENRE, and HTLM De Cao et al., 2020;Aghajanyan et al., 2021). We set the new state-of-the-art in zero-shot summarization, entity linking, and entity disambiguation while maintaining competitive performance in the fine-tuning setting. We can generate images unconditionally, conditioned on text (like DALL-E) and do captioning all in a zero-shot setting with a single model.

“…[92,16,91]) which directly leads to toxic biases (e.g. [41,32,11]); we trained our model on YouTube, which is a moderated platform [101]. Though the content moderation might perhaps reduce overtly 'toxic' content, social media platforms like YouTube still contain harmful microagressions [15], and alt-lite to alt-right content [94].…”

Section: A2 Biases In (Pre)training Datamentioning

confidence: 99%

MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound

Lü²,

Lu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Add a third of a cup of popcorn Now turn the heat on high Add a lid, and then *sizzling* *pouring sound* *lid clinking* jiggle it while it pops *jiggling, popcorn popping*

“…1 The images were subjected to an array of automated filters designed to remove potentially offensive content. While certainly not perfect, this substantially reduces the issues that plague other large image datasets [8,55]. We construct a multi-label dataset using these images by converting all hashtags into their corresponding canonical targets (note that a single image may have multiple hashtags).…”

Section: Hashtag Dataset Collectionmentioning

confidence: 99%

Revisiting Weakly Supervised Pre-Training of Visual Perception Models

et al. 2022

Preprint

View full text Add to dashboard Cite

Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pretraining can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags. We study the performance of the resulting models in various transfer-learning settings including zero-shot transfer. We also compare our models with those obtained via large-scale self-supervised learning. We find our weakly-supervised models to be very competitive across all settings, and find they substantially outperform their self-supervised counterparts. We also include an investigation into whether our models learned potentially troubling associations or stereotypes. Overall, our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems. Our models, Supervised Weakly through hashtAGs (SWAG), are available publicly.

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.