2022
DOI: 10.48550/arxiv.2202.03555
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Abstract: While the general idea of self-supervised learning is identical across modalities, the actual algorithms and objectives differ widely because they were developed with a single modality in mind. To get us closer to general self-supervised learning, we present data2vec, a framework that uses the same learning method for either speech, NLP or computer vision. The core idea is to predict latent representations of the full input data based on a masked view of the input in a selfdistillation setup using a standard T… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
127
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 78 publications
(129 citation statements)
references
References 21 publications
2
127
0
Order By: Relevance
“…The learning dynamics of Odin also warrant further investigation, as well as the objective used for representation learning. Recent work has revived interest in masked-autoencoding [7,23,36] and maskeddistillation [6] as viable alternatives to contrastive learning. Odin, by proposing to leverage the learned representations in the design of iteratively more refined self-supervised tasks, is well positioned to benefit them as well.…”
Section: Discussionmentioning
confidence: 99%
“…The learning dynamics of Odin also warrant further investigation, as well as the objective used for representation learning. Recent work has revived interest in masked-autoencoding [7,23,36] and maskeddistillation [6] as viable alternatives to contrastive learning. Odin, by proposing to leverage the learned representations in the design of iteratively more refined self-supervised tasks, is well positioned to benefit them as well.…”
Section: Discussionmentioning
confidence: 99%
“…Finally, the recently proposed masked auto-encoder (MAE) [29,56,22,20,2] is a new SSL family. It builds on a reconstruction task that randomly masks image patches and then reconstructs the missing pixels or semantic features via an auto-encoder.…”
Section: Related Workmentioning
confidence: 99%
“…These results well testify the high quality, generality and transferability of the learnt features by Mugs. Note that in this work, we evaluate the effectiveness of Mugs through vision transformer (ViT) [23,39], as ViT often achieves better performance than CNN of the same model size [49,39] and also shows great potential to unify vision and language models [28,2].…”
Section: Introductionmentioning
confidence: 99%
“…Specifically, we mask spans of latent speech representations in the student model and make the student model predict masked parts as the output of the teacher model. Inspired by [13], we introduce contextualized representations as the training target, i.e., average top-k normalized latent representations, where we set k = 8 as [13]. Unlike self-distillation in [13], we leverage a pre-trained speech model as the teacher.…”
Section: Pre-training Distillationmentioning
confidence: 99%
“…Inspired by [13], we introduce contextualized representations as the training target, i.e., average top-k normalized latent representations, where we set k = 8 as [13]. Unlike self-distillation in [13], we leverage a pre-trained speech model as the teacher. Formally, given a downsampled audio sequence x, the student is to minimize the L1 distance within masked time steps M as…”
Section: Pre-training Distillationmentioning
confidence: 99%