2022
DOI: 10.1007/s41095-022-0274-8
|View full text |Cite
|
Sign up to set email alerts
|

PVT v2: Improved baselines with Pyramid Vision Transformer

Abstract: Transformers have recently lead to encouraging progress in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (PVT v1) by adding three designs: (i) a linear complexity attention layer, (ii) an overlapping patch embedding, and (iii) a convolutional feed-forward network. With these modifications, PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification, dete… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
408
0
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 1,028 publications
(410 citation statements)
references
References 34 publications
1
408
0
1
Order By: Relevance
“…Touvron et al [13] improve the training strategy of ViT and propose a knowledge distillation method, which helps ViT achieve performance comparable with CNNs trained only on ImageNet. Then various works endeavor to explore efficient vision Transformer architecture designs, e.g., PVT [16,35], Swin [14,60], Twins [15], MViT [17,37], and others [61,36,62,38,39,63,26,64]. Transformer also presents its superiority on various tasks, e.g., object detection [11,23], segmentation [21,24,20,65], pose estimation [22,22], tracking [19,66] and GAN [25,67].…”
Section: Self-attention Based Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Touvron et al [13] improve the training strategy of ViT and propose a knowledge distillation method, which helps ViT achieve performance comparable with CNNs trained only on ImageNet. Then various works endeavor to explore efficient vision Transformer architecture designs, e.g., PVT [16,35], Swin [14,60], Twins [15], MViT [17,37], and others [61,36,62,38,39,63,26,64]. Transformer also presents its superiority on various tasks, e.g., object detection [11,23], segmentation [21,24,20,65], pose estimation [22,22], tracking [19,66] and GAN [25,67].…”
Section: Self-attention Based Modelsmentioning
confidence: 99%
“…CNN-based architectures [1,3,2,7,8,10,9] locally mix tokens within a shifted window with the fixed shape. Transformer-based architectures [12,13,14,35,16,15,36,37,17,38,39] perform message passing from other tokens into the query token based on the calculated pairwise attention weights, depending on the affinities between tokens in the embedding space. MLPbased architectures mostly enable information interaction through spatial fully connections across all tokens [28,30,40,34] or across certain tokens selected with hand-crafted rules in a deterministic manner [31,33,41,32,29,42].…”
Section: Introductionmentioning
confidence: 99%
“…3.2, we propose to implement the probabilistic encoder with a Pyramid Transformer architecture tailored for time-series data and the probabilistic decoder with a simple Transformer model without upsampling operations. The neural architecture design is heavily inspired by recent advances in Transformer models [3,6,30,42,43] to reduce the potentially prohibitive cost of verifying building components and hyperparameter tuning while maintaining the model as simple as possible, which is beneficial for demonstrating the effectiveness of the proposed modifications to the vanilla VAE objective. The probabilistic encoder is adapted from the recently proposed Pyramid Vision Transformer (PVT) [42,43], which is composed of the overlapping patch embedding, conditional position encoding (CPE), multi-head self-attention (MHSA), and feed-forward (FFD) layers combined as the basic building unit, as illustrated in Fig.…”
Section: Model Instantiationmentioning
confidence: 99%
“…Therefore, it is necessary to develop highly modularized neural architectures given the widespread success of Transformers [26,41], thereby reducing the burden of researchers caused by manual architecture design from scratch. This motivates us to adapt the successful Pyramid Vision Transformer model [42,43] proposed for computer vision tasks to process time-series data. Surprisingly, this simple model can yield superior performance compared to the existing state-of-the-art models while only operating on the temporal dimension without the need to devise tailored spatial components.…”
mentioning
confidence: 99%
“…However, it can still not harness the semantic context during finetuning due to the relatively smaller size of the dataset and a change in the number and nature of semantic classes from classification to the segmentation task. Hierarchical vision transformers [47,48] tackle the problem with progressive downsampling of features along the stages, although they still lack the semantic context of the image.…”
Section: Introductionmentioning
confidence: 99%