PVT v2: Improved baselines with Pyramid Vision Transformer

Wang, Wenhai; Xie, Enze; Li, Xiang; Fan, Deng-Ping; Song, Kaitao; Ding, Liang; Lü, Tong; Luo, Ping; Shao, Ling

doi:10.1007/s41095-022-0274-8

Cited by 1,028 publications

(410 citation statements)

References 34 publications

Supporting

Mentioning

408

Contrasting

Unclassified

Order By: Relevance

“…Touvron et al [13] improve the training strategy of ViT and propose a knowledge distillation method, which helps ViT achieve performance comparable with CNNs trained only on ImageNet. Then various works endeavor to explore efficient vision Transformer architecture designs, e.g., PVT [16,35], Swin [14,60], Twins [15], MViT [17,37], and others [61,36,62,38,39,63,26,64]. Transformer also presents its superiority on various tasks, e.g., object detection [11,23], segmentation [21,24,20,65], pose estimation [22,22], tracking [19,66] and GAN [25,67].…”

Section: Self-attention Based Modelsmentioning

confidence: 99%

“…CNN-based architectures [1,3,2,7,8,10,9] locally mix tokens within a shifted window with the fixed shape. Transformer-based architectures [12,13,14,35,16,15,36,37,17,38,39] perform message passing from other tokens into the query token based on the calculated pairwise attention weights, depending on the affinities between tokens in the embedding space. MLPbased architectures mostly enable information interaction through spatial fully connections across all tokens [28,30,40,34] or across certain tokens selected with hand-crafted rules in a deterministic manner [31,33,41,32,29,42].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Active Token Mixer

Wei¹,

Zhang²,

Lan³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper presents ActiveMLP, a general MLP-like backbone for computer vision. The three existing dominant network families, i.e., CNNs, Transformers and MLPs, differ from each other mainly in the ways to fuse contextual information into a given token, leaving the design of more effective token-mixing mechanisms at the core of backbone architecture development. In ActiveMLP, we propose an innovative token-mixer, dubbed Active Token Mixer (ATM), to actively incorporate contextual information from other tokens in the global scope into the given one. This fundamental operator actively predicts where to capture useful contexts and learns how to fuse the captured contexts with the original information of the given token at channel levels. In this way, the spatial range of token-mixing is expanded and the way of token-mixing is reformed. With this design, ActiveMLP is endowed with the merits of global receptive fields and more flexible contentadaptive information fusion. Extensive experiments demonstrate that ActiveMLP is generally applicable and comprehensively surpasses different families of SOTA vision backbones by a clear margin on a broad range of vision tasks, including visual recognition and dense prediction tasks. The code and models will be available at https://github.com/microsoft/ActiveMLP.

show abstract

Section: Self-attention Based Modelsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Active Token Mixer

Wei¹,

Zhang²,

Lan³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…3.2, we propose to implement the probabilistic encoder with a Pyramid Transformer architecture tailored for time-series data and the probabilistic decoder with a simple Transformer model without upsampling operations. The neural architecture design is heavily inspired by recent advances in Transformer models [3,6,30,42,43] to reduce the potentially prohibitive cost of verifying building components and hyperparameter tuning while maintaining the model as simple as possible, which is beneficial for demonstrating the effectiveness of the proposed modifications to the vanilla VAE objective. The probabilistic encoder is adapted from the recently proposed Pyramid Vision Transformer (PVT) [42,43], which is composed of the overlapping patch embedding, conditional position encoding (CPE), multi-head self-attention (MHSA), and feed-forward (FFD) layers combined as the basic building unit, as illustrated in Fig.…”

Section: Model Instantiationmentioning

confidence: 99%

“…Therefore, it is necessary to develop highly modularized neural architectures given the widespread success of Transformers [26,41], thereby reducing the burden of researchers caused by manual architecture design from scratch. This motivates us to adapt the successful Pyramid Vision Transformer model [42,43] proposed for computer vision tasks to process time-series data. Surprisingly, this simple model can yield superior performance compared to the existing state-of-the-art models while only operating on the temporal dimension without the need to devise tailored spatial components.…”

mentioning

confidence: 99%

Tampered VAE for Improved Satellite Image Time Series Classification

X¹,

Bi²,

Nicholl³

2022

Preprint

View full text Add to dashboard Cite

The unprecedented availability of spatial and temporal highresolution satellite image time series (SITS) for crop type mapping is believed to necessitate deep learning architectures to accommodate challenges arising from both dimensions. Recent state-of-the-art deep learning models have shown promising results by stacking spatial and temporal encoders. However, we present a Pyramid Time-Series Transformer (PTST) that operates solely on the temporal dimension, i.e., neglecting the spatial dimension, can produce superior results with a drastic reduction in GPU memory consumption and easy extensibility. Furthermore, we augment it to perform semi-supervised learning by proposing a classification-friendly VAE framework that introduces clustering mechanisms into latent space and can promote linear separability therein. Consequently, a few principal axes of the latent space can explain the majority of variance in raw data. Meanwhile, the VAE framework with proposed tweaks can maintain competitive classification performance as its purely discriminative counterpart when only 40% of labelled data is used. We hope the proposed framework can serve as a baseline for crop classification with SITS for its modularity and simplicity. IntroductionRecent advances in remote sensing has enabled monitoring the Earth surface with unprecedentedly high resolution and frequency, which is particularly helpful for the study of vegetation. For example, the recently released DENETHOR [24] dataset provides harmonized, declouded, and daily Planet Fusion Surface Reflection data for crop type mapping. However, such high temporal and spatial density present significant challenges for developing deep learning models that can fully exploit the available information while maintaining the computation budgets in a reasonable limit.Recent state-of-the-art deep learning models feature a combination of spatial and temporal encoders, e.g., the Pixel-Set spatial and the self-attention temporal encoders [11,12]. Despite the promising results achieved, such kind of models require prohibitively high GPU memory consumption due to the extraction of spatial descriptors from raw pixels using neural networks even with an extremely lightweight model. Seemingly, the learned spatial descriptors are superior to simple statistics, such as mean and standard deviation values of the parcel fields.

show abstract

“…However, it can still not harness the semantic context during finetuning due to the relatively smaller size of the dataset and a change in the number and nature of semantic classes from classification to the segmentation task. Hierarchical vision transformers [47,48] tackle the problem with progressive downsampling of features along the stages, although they still lack the semantic context of the image.…”

Section: Introductionmentioning

confidence: 99%

SeMask: Semantically Masked Transformers for Semantic Segmentation

Jain¹,

Singh²,

Orlov³

et al. 2021

Preprint

View full text Add to dashboard Cite

Finetuning a pretrained backbone in the encoder part of an image transformer network has been the traditional approach for the semantic segmentation task. However, such an approach leaves out the semantic context that an image provides during the encoding stage. This paper argues that incorporating semantic information of the image into pretrained hierarchical transformer-based backbones while finetuning improves the performance considerably. To achieve this, we propose SeMask, a simple and effective framework that incorporates semantic information into the encoder with the help of a semantic attention operation. In addition, we use a lightweight semantic decoder during training to provide supervision to the intermediate semantic prior maps at every stage. Our experiments demonstrate that incorporating semantic priors enhances the performance of the established hierarchical encoders with a slight increase in the number of FLOPs. We provide empirical proof by integrating Se-Mask into each variant of the Swin-Transformer as our encoder paired with different decoders. Our framework achieves a new state-of-the-art of 58.22% mIoU on the ADE20K dataset and improvements of over 3% in the mIoU metric on the Cityscapes dataset. The code and checkpoints are publicly available at https://github.com/Picsart-AI-Research/SeMask-Segmentation.

show abstract

PVT v2: Improved baselines with Pyramid Vision Transformer

Cited by 1,028 publications

References 34 publications

Active Token Mixer

Active Token Mixer

Tampered VAE for Improved Satellite Image Time Series Classification

SeMask: Semantically Masked Transformers for Semantic Segmentation

Contact Info

Product

Resources

About