2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01553
|View full text |Cite
|
Sign up to set email alerts
|

Masked Autoencoders Are Scalable Vision Learners

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

25
1,809
3
2

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 3,566 publications
(1,839 citation statements)
references
References 25 publications
25
1,809
3
2
Order By: Relevance
“…Compared with plain ViT, AugReg-ViT and ViT with discrete representation called DrViT [41], DAT with AugReg achieves better performance. The best result is from ViT-Huge pretrained by MAE [19] and finetuned by DAT, which suggests DAT is also effective in downstream fine-tuning tasks.…”
Section: Resultsmentioning
confidence: 94%
See 1 more Smart Citation
“…Compared with plain ViT, AugReg-ViT and ViT with discrete representation called DrViT [41], DAT with AugReg achieves better performance. The best result is from ViT-Huge pretrained by MAE [19] and finetuned by DAT, which suggests DAT is also effective in downstream fine-tuning tasks.…”
Section: Resultsmentioning
confidence: 94%
“…For ViTs, we adopt ViT-B/16 as baseline models, which is trained by the recipes in AugReg [48]. Besides, we use DAT to conduct supervised finetuning on downstream ImageNet classification task based on a self-supervised ViT-Huge pretrained by MAE [19]. By default, we refer ViT to ViT-B/16 in all tables and figures.…”
Section: Image Classificationmentioning
confidence: 99%
“…In image inpainting, Deng et al proposed Contextual Transformer Network for improving the continuity of context. [16] proposed an decoder-encoder based transformer which name masked autoencoder. What's more, [9] develops an inpainting transformer for completing large missing regions, which is the state-of-the-art method up to now.…”
Section: Image Completionmentioning
confidence: 99%
“…We designed a transformer based neural network to achieve phase completion for fringe projection profiler. Our network is inspired by [16] and [9], which is the simplification of [9] due to memory cost and forward speed considerations. The model has three parts: a convolution-based encoder, a transformer-based module with contextual attention, and a feedforward reconstruction module.…”
Section: Completion Modelmentioning
confidence: 99%
“…But, the training setup e.g., batch size in original papers is not always affordable for research institutions. More recently, masked image modeling (MIM) methods represented by MAE [11] have been proved to learn rich visual representations and significantly improve the performance on downstream tasks [18]. After randomly adding masks to input images, a pixellevel regression target is set as a pretext task.…”
Section: Introductionmentioning
confidence: 99%