2022
DOI: 10.48550/arxiv.2205.13515
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Green Hierarchical Vision Transformer for Masked Image Modeling

Abstract: We present an efficient approach for Masked Image Modeling (MIM) with hierarchical Vision Transformers (ViTs), e.g., Swin Transformer [43], allowing the hierarchical ViTs to discard masked patches and operate only on the visible ones. Our approach consists of two key components. First, for the window attention, we design a Group Window Attention scheme following the Divide-and-Conquer strategy. To mitigate the quadratic complexity of the self-attention w.r.t. the number of patches, group attention encourages a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
20
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(20 citation statements)
references
References 50 publications
0
20
0
Order By: Relevance
“…ConvMAE [39] presents a simple self-supervised learning framework with a block-wise masking strategy, which demonstrates that multi-scale features from supervised encoders can improve the performance of downstream tasks. The very recent approach Green-MAE [40] is similar to our approach, allowing the hierarchical models to discard masked patches and operate only on the visible ones. Our CoTMAE benefits from the development of hybrid convolutional-transformer pyramid networks and useful experience gained from recent works [34][35][36][37][38][39][40][41][42].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…ConvMAE [39] presents a simple self-supervised learning framework with a block-wise masking strategy, which demonstrates that multi-scale features from supervised encoders can improve the performance of downstream tasks. The very recent approach Green-MAE [40] is similar to our approach, allowing the hierarchical models to discard masked patches and operate only on the visible ones. Our CoTMAE benefits from the development of hybrid convolutional-transformer pyramid networks and useful experience gained from recent works [34][35][36][37][38][39][40][41][42].…”
Section: Related Workmentioning
confidence: 99%
“…The very recent approach Green-MAE [40] is similar to our approach, allowing the hierarchical models to discard masked patches and operate only on the visible ones. Our CoTMAE benefits from the development of hybrid convolutional-transformer pyramid networks and useful experience gained from recent works [34][35][36][37][38][39][40][41][42].…”
Section: Related Workmentioning
confidence: 99%
“…Despite superior performance on various downstream tasks, these models have huge computational burden and slow learning process [31]. They typically require thousands of GPU Hours for pre-training on ImageNet-1K to get generalizing representations.…”
Section: Introductionmentioning
confidence: 99%
“…To this end, MAE [24] pioneered the asymmetric encoder-decoder strategy, where the costly encoder only operates few visible patches and the lightweight decoder takes all the patches as input for prediction. Further, GreenMIM [31] extends the asymmetric encoder-decoder strategy to hierarchical vision transformers (e.g., Swin [39]). Besides, [8,22,35] shrinks the input resolution to lessen the input patches, thereby reducing the computational burden.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation