Masked Autoencoders Are Scalable Vision Learners

He, Kai; Chen, Xinlei; Xie, Sihong; Li, Yanghao; Dollár, Piotr; Girshick, Ross

doi:10.48550/arxiv.2111.06377

Cited by 308 publications

(945 citation statements)

References 35 publications

(65 reference statements)

Supporting

Mentioning

919

Contrasting

Order By: Relevance

“…For ResNet-200, the initial number of blocks at each stage is (3,24,36,3). We change it to Swin-B's (3, 3, 27, 3) at the step of changing stage ratio.…”

Section: Modernizing Resnets: Detailed Resultsmentioning

confidence: 99%

See 1 more Smart Citation

A ConvNet for the 2020s

Liu¹,

Wu²,

Feichtenhofer³

et al. 2022

Preprint

Self Cite

194

210

View full text Add to dashboard Cite

The "Roaring 20s" of visual recognition began with the introduction of Vision Transformers (ViTs), which quickly superseded ConvNets as the state-of-the-art image classification model. A vanilla ViT, on the other hand, faces difficulties when applied to general computer vision tasks such as object detection and semantic segmentation. It is the hierarchical Transformers (e.g., Swin Transformers) that reintroduced several ConvNet priors, making Transformers practically viable as a generic vision backbone and demonstrating remarkable performance on a wide variety of vision tasks. However, the effectiveness of such hybrid approaches is still largely credited to the intrinsic superiority of Transformers, rather than the inherent inductive biases of convolutions. In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually "modernize" a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed Con-vNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

show abstract

“…For ResNet-200, the initial number of blocks at each stage is (3,24,36,3). We change it to Swin-B's (3, 3, 27, 3) at the step of changing stage ratio.…”

Section: Modernizing Resnets: Detailed Resultsmentioning

confidence: 99%

“…4). We use the supervised training results from DeiT [68] for ViT-S/B and MAE [24] for ViT-L, as they employ improved training procedures over the original ViTs [18]. ConvNeXt models are trained with the same settings as before, but with longer warmup epochs.…”

Section: Isotropic Convnext Vs Vitmentioning

confidence: 99%

A ConvNet for the 2020s

Liu¹,

Wu²,

Feichtenhofer³

et al. 2022

Preprint

Self Cite

194

210

View full text Add to dashboard Cite

show abstract

“…However, BYOL (Grill et al, 2020) finds that when maximizing the similarity between two augmentations of one image, negative sample pairs are not necessary. Further, SimSiam (Chen and He, 2021) finds that momentum encoder is also not necessary while a stop-gradient operation applied on one side is enough for learning transferable representations.…”

Section: Jiang Et Almentioning

confidence: 99%

Transferability in Deep Learning: A Survey

Jiang¹,

Yang²,

Wang³

et al. 2022

Preprint

View full text Add to dashboard Cite

The success of deep learning algorithms generally depends on large-scale data, while humans appear to have inherent ability of knowledge transfer, by recognizing and applying relevant knowledge from previous learning experiences when encountering and solving unseen tasks. Such an ability to acquire and reuse knowledge is known as transferability in deep learning. It has formed the long-term quest towards making deep learning as data-efficient as human learning, and has been motivating fruitful design of more powerful deep learning algorithms. We present this survey to connect different isolated areas in deep learning with their relation to transferability, and to provide a unified and complete view to investigating transferability through the whole lifecycle of deep learning. The survey elaborates the fundamental goals and challenges in parallel with the core principles and methods, covering recent cornerstones in deep architectures, pre-training, task adaptation and domain adaptation. This highlights unanswered questions on the appropriate objectives for learning transferable knowledge and for adapting the knowledge to new tasks and domains, avoiding catastrophic forgetting and negative transfer. Finally, we implement a benchmark and an open-source library, enabling a fair evaluation of deep learning methods in terms of transferability.

show abstract

“…Autoencoding is a classical method for representation learning [25,46], which has been out-performed by contrastive learning approaches for years. However, the recent work in this line, He et al [18], has reclaimed state-of-the-art performance.…”

Section: Related Workmentioning

confidence: 99%

“…Substantial effort has been devoted to self-supervised learning methods for 2D images [6,9,25,34,51]. Among this line, autoencoder is one of the most classical methods [3,18,34,45,46]. Typically, it has an encoder that transforms the input into a latent code and a decoder that expands the latent code to reconstruct the input.…”

Section: Introductionmentioning

confidence: 99%

Implicit Autoencoder for Point Cloud Self-supervised Representation Learning

Shen¹,

Yang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Many 3D representations (e.g., point clouds) are discrete samples of the underlying continuous 3D surface. This process inevitably introduces sampling variations on the underlying 3D shapes. In learning 3D representation, the variations should be disregarded while transferable knowledge of the underlying 3D shape should be captured. This becomes a grand challenge in existing representation learning paradigms. This paper studies autoencoding on point clouds. The standard autoencoding paradigm forces the encoder to capture such sampling variations as the decoder has to reconstruct the original point cloud that has sampling variations. We introduce Implicit Autoencoder(IAE), a simple yet effective method that addresses this challenge by replacing the point cloud decoder with an implicit decoder. The implicit decoder outputs a continuous representation that is shared among different point cloud sampling of the same model. Reconstructing under the implicit representation can prioritize that the encoder discards sampling variations, introducing more space to learn useful features. We theoretically justify this claim under a simple linear autoencoder. Moreover, the implicit decoder offers a rich space to design suitable implicit representations for different tasks. We demonstrate the usefulness of IAE across various self-supervised learning tasks for both 3D objects and 3D scenes. Experimental results show that IAE consistently outperforms the state-of-the-art in each task. Our code will be available at https://github.com/SimingYan/IAE.

show abstract

Masked Autoencoders Are Scalable Vision Learners

Cited by 308 publications

References 35 publications

A ConvNet for the 2020s

A ConvNet for the 2020s

Transferability in Deep Learning: A Survey

Implicit Autoencoder for Point Cloud Self-supervised Representation Learning

Contact Info

Product

Resources

About