2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.00061
|View full text |Cite
|
Sign up to set email alerts
|

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
1,351
1
4

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 2,980 publications
(1,357 citation statements)
references
References 41 publications
1
1,351
1
4
Order By: Relevance
“…ResNet [13] is the most widely used convolutional model while RegNet [35] is a family of carefully designed CNN models. We also compare with recent hierarchical vision transformers PVT [44] and Swin [27]. Benefiting from the log-linear complexity, GFNet-H models show significantly better performance than ResNet, RegNet and PVT and achieve similar performance with Swin while having a much simpler and more generic design.…”
Section: Imagenet Resultsmentioning
confidence: 99%
See 4 more Smart Citations
“…ResNet [13] is the most widely used convolutional model while RegNet [35] is a family of carefully designed CNN models. We also compare with recent hierarchical vision transformers PVT [44] and Swin [27]. Benefiting from the log-linear complexity, GFNet-H models show significantly better performance than ResNet, RegNet and PVT and achieve similar performance with Swin while having a much simpler and more generic design.…”
Section: Imagenet Resultsmentioning
confidence: 99%
“…Then, we obtain 3 variants of the model (GFNet-Ti, GFNet-S and GFNet-B) by simply adjusting the depth and embedding dimension, which have similar computational costs with ResNet-18, 50 and 101 [13]. For hierarchical models, we also design three models (GFNet-H-Ti, GFNet-H-S and GFNet-H-B) that have these three levels of complexity following the design of PVT [44]. We use 4 × 4 patch embedding to form the input tokens and use a non-overlapping convolution layer to downsample tokens following [44,27].…”
Section: Modelmentioning
confidence: 99%
See 3 more Smart Citations