2022
DOI: 10.48550/arxiv.2204.08227
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

The Devil is in the Frequency: Geminated Gestalt Autoencoder for Self-Supervised Visual Pre-Training

Abstract: The self-supervised Masked Image Modeling (MIM) schema, following "mask-and-reconstruct" pipeline of recovering contents from masked image, has recently captured the increasing interest in the community, owing to the excellent ability of learning visual representation from unlabeled data. Aiming at learning representations with high semantics abstracted, a group of works attempts to reconstruct non-semantic pixels with large-ratio masking strategy, which may suffer from "over-smoothing" problem, while others d… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 43 publications
(87 reference statements)
0
4
0
Order By: Relevance
“…For example, as a representative of such methods, the reconstruction quality of MAE [27] is poor: fine details and textures are missing (Figure 2). A similar issue exists in many other MIM methods [11,36].…”
Section: Introductionmentioning
confidence: 68%
“…For example, as a representative of such methods, the reconstruction quality of MAE [27] is poor: fine details and textures are missing (Figure 2). A similar issue exists in many other MIM methods [11,36].…”
Section: Introductionmentioning
confidence: 68%
“…With the development of vision transformers [18,23,39,50,55], Masked Image Modeling (MIM) gradually replaces the dominant position of contrastive learning [10,25,54] in visual self-supervised representation learning due to its superior fine-tuning performance in various visual downstream tasks. Many target signals have been designed for the mask-prediction pretext task in MIM, such as normalized pixels [24,60], discrete tokens [2,17], HOG feature [57], deep features [1,67] or frequencies [38,59]. However, they are all only applied as single-scale supervisions for reconstruction.…”
Section: Related Workmentioning
confidence: 99%
“…MIM learns semantic representations by first masking some parts of the input and then predicting their signals based on the unmasked parts, e.g., normalized pixels [24,60], discrete tokens [2,17], HOG fea-* Corresponding author. ture [57], deep features [1,67] or frequencies [38,59].…”
Section: Introductionmentioning
confidence: 99%
“…Besides, Fang et al [22] employ an auxiliary generator to corrupt the input images. As to what to predict, beyond default raw pixels [26,79], several other reconstruction targets are proposed, e.g., hand-crafted or deep features [65], low or high frequencies [38,74], and discrete tokens [2]. Correlational modeling is the crucial process in visual tracking [81], aiming to predict a dense set of matching confidence for a target object.…”
Section: Related Workmentioning
confidence: 99%