2022
DOI: 10.48550/arxiv.2205.14204
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multimodal Masked Autoencoders Learn Transferable Representations

Abstract: Building scalable models to learn from diverse, multimodal data remains an open challenge. For vision-language data, the dominant approaches are based on contrastive learning objectives that train a separate encoder for each modality. While effective, contrastive learning approaches introduce sampling bias depending on the data augmentations used, which can degrade performance on downstream tasks. Moreover, these methods are limited to paired image-text data, and cannot leverage widely-available unpaired data.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
19
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
2

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(20 citation statements)
references
References 20 publications
1
19
0
Order By: Relevance
“…It would also be interesting to study introducing auxiliary prediction for other modalities, such as audio. Another weakness is that our model operates only on RGB pixels from a single camera viewpoint; we look forward to a future work that incorporates different input modalities such as proprioceptive states and point clouds, building on top of the recent multi-modal learning approaches [52,53]. Finally, our approach trains behaviors from scratch, which makes it still too sample-inefficient to be used in real-world scenarios.…”
Section: Discussionmentioning
confidence: 99%
“…It would also be interesting to study introducing auxiliary prediction for other modalities, such as audio. Another weakness is that our model operates only on RGB pixels from a single camera viewpoint; we look forward to a future work that incorporates different input modalities such as proprioceptive states and point clouds, building on top of the recent multi-modal learning approaches [52,53]. Finally, our approach trains behaviors from scratch, which makes it still too sample-inefficient to be used in real-world scenarios.…”
Section: Discussionmentioning
confidence: 99%
“…The pioneering works of CLIP [57] and ALIGN [31] make use of contrastive learning to pretrain models on billion-scale web-crawled image-text pairs. There are an increasing number of studies to improve their generality from various modeling perspectives, including training objectives [17,19,20,52,82,85], scaling techniques [15,54,82], data efficiency [38,41], and leveraging multilingual correlations [15,30]. In academia, several works demonstrate techniques to improve the learned semantic representations on datasets at a smaller scale (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Masking across modalities: BART [51] applies a similar strategy to text data, but uses a cross-entropy loss over the masked-token output distribution and a far lower masking ratio (typically around 15%). M3AE [31] extends MAE and BART to incorporate inputs from both text and image modalities, using per-modality input and output projections. Otherwise the encoder-decoder architecture resembles MAE, with the addition of learned modality-indicating embeddings to each transformer input.…”
Section: Masked Autoencodersmentioning
confidence: 99%