2021
DOI: 10.48550/arxiv.2104.11227
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Multiscale Vision Transformers

Abstract: We present Multiscale Vision Transformers (MViT) for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models. Multiscale Transformers have several channel-resolution scale stages. Starting from the input resolution and a small channel dimension, the stages hierarchically expand the channel capacity while reducing the spatial resolution. This creates a multiscale pyramid of features with early layers operating at high spatial resolution to model simp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
123
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 69 publications
(123 citation statements)
references
References 104 publications
0
123
0
Order By: Relevance
“…Along this line of research, the main focus is to improve the attention mechanism, so that it can satisfy the intrinsic properties of visual signals. For example, MSViT (Fan et al 2021) builds hierarchical attention layers to obtain multi-scale features. Swin Transformers (Liu et al 2021b) introduces a locality constrain into its attention mechanism.…”
Section: Related Work Attention and Vision Transformersmentioning
confidence: 99%
“…Along this line of research, the main focus is to improve the attention mechanism, so that it can satisfy the intrinsic properties of visual signals. For example, MSViT (Fan et al 2021) builds hierarchical attention layers to obtain multi-scale features. Swin Transformers (Liu et al 2021b) introduces a locality constrain into its attention mechanism.…”
Section: Related Work Attention and Vision Transformersmentioning
confidence: 99%
“…Recently, the pioneering work ViT [22] successfully applies the pure transformer-based architecture to computer vision, revealing the potential of transformer in handling visual tasks. Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87].…”
Section: Related Workmentioning
confidence: 99%
“…Many attempts have been made to integrate the long-range modeling into CNNs, such as non-local networks [41,48], relation networks [21], etc. Vision Transformer (ViT) [12] first introduced a set of pure Transformer backbones for image classification, and its follow-ups modify the vision transformer soon to dominate many downstream tasks for computer vision, such as object detection [6,53], semantic segmentation [26], action recognition [3,14], 2D/3D human pose estimation [47,52], and 3D object detection [31]. It has shown great potential to be an alternative backbone for convolutional neural networks.…”
Section: Related Workmentioning
confidence: 99%
“…The vision transformer has achieved stunning success in computer vision since ViT [12]. It has shown impressive capability upon convolutional neural networks (CNNs) on prevalent visual domains, including image classification [9,39], object detection [5,53], semantic segmentation [26], action recognition [3,14] with both supervised and self-supervised [1,19] training configurations. Along with the development of ViT models, the deployment of vi- § Corresponding author.…”
Section: Introductionmentioning
confidence: 99%