2021
DOI: 10.48550/arxiv.2112.01526
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

MViTv2: Improved Multiscale Vision Transformers for Classification and Detection

Abstract: In this paper, we study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection. We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections. We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition where it outperforms prior work. We further compare MViTs' pooling attention to window atte… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
61
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 32 publications
(61 citation statements)
references
References 54 publications
0
61
0
Order By: Relevance
“…They have also been generalized from the image to video domain [3,22,51,54]. In this work, we build our architecture based on the Multiscale Vision Transformer (MViT) architecture [22,44] as a con-crete instance, but the general idea can be applied to other ViT-based video models.…”
Section: Related Workmentioning
confidence: 99%
See 4 more Smart Citations
“…They have also been generalized from the image to video domain [3,22,51,54]. In this work, we build our architecture based on the Multiscale Vision Transformer (MViT) architecture [22,44] as a con-crete instance, but the general idea can be applied to other ViT-based video models.…”
Section: Related Workmentioning
confidence: 99%
“…In this paper, we build MeMViT based on the MViT [22,44] architecture due to its strong performance, but the techniques presented in this paper can be applied to other ViTbased architectures. For completeness, we review ViT and MViT and introduce notations used in this paper next.…”
Section: Preliminariesmentioning
confidence: 99%
See 3 more Smart Citations