2021
DOI: 10.48550/arxiv.2112.09133
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Masked Feature Prediction for Self-Supervised Visual Pre-Training

Abstract: We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
105
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 48 publications
(106 citation statements)
references
References 54 publications
1
105
0
Order By: Relevance
“…Therefore our method can provide abundant nontrivial image pairs feeding the enhancer. (Wei et al, 2021) proposes to predict hand-crafted image feature descriptors at the masked positions. As MIM is originated in masked language modeling (Devlin et al, 2019), CIM is inspired by (Clark et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Therefore our method can provide abundant nontrivial image pairs feeding the enhancer. (Wei et al, 2021) proposes to predict hand-crafted image feature descriptors at the masked positions. As MIM is originated in masked language modeling (Devlin et al, 2019), CIM is inspired by (Clark et al, 2020).…”
Section: Related Workmentioning
confidence: 99%
“…(MIM; Bao et al, 2021), which randomly masks out some input tokens and then recovers the masked content by conditioning on the visible context, is able to learn rich visual representations and shows promising performance on various vision benchmarks (Zhou et al, 2021;He et al, 2021;Xie et al, 2021;Dong et al, 2021;Wei et al, 2021;El-Nouby et al, 2021).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In this work, we found that our proposed MIM is more effective than MLM. Inspired by recent works of self-supervised learning on vision [12,42], we propose to mask out image patches with larger proportion and follow MaskFeat [41] to reconstruct other views of the whole image rather than recovering those masked regions only.…”
Section: Related Workmentioning
confidence: 99%
“…Our supervision is provided by another view of the original intact input image. While MaskFeat [41] used Histograms of Oriented Gradients (HOG) as the supervision for visual pre-training, we rely on the more discriminative signals from ClusterFit [43] and GrokNet [4] to extract additional two views of the raw image. Between the two, [43] provides the clustering probability 𝑐 (v) while [4] extracts the pool5 embedding 𝑟 (v) (feature output from the 5-th Conv Block).…”
Section: Image-text Pre-trainingmentioning
confidence: 99%