Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Wang, Wenhai; Xie, Enze; Li, Xiang; Fan, Deng-Ping; Song, Kaitao; Ding, Liang; Lü, Tong; Luo, Ping; Shao, Ling

doi:10.1109/iccv48922.2021.00061

Cited by 2,980 publications

(1,357 citation statements)

References 41 publications

Supporting

Mentioning

1,351

Contrasting

Unclassified

Order By: Relevance

“…ResNet [13] is the most widely used convolutional model while RegNet [35] is a family of carefully designed CNN models. We also compare with recent hierarchical vision transformers PVT [44] and Swin [27]. Benefiting from the log-linear complexity, GFNet-H models show significantly better performance than ResNet, RegNet and PVT and achieve similar performance with Swin while having a much simpler and more generic design.…”

Section: Imagenet Resultsmentioning

confidence: 99%

“…Then, we obtain 3 variants of the model (GFNet-Ti, GFNet-S and GFNet-B) by simply adjusting the depth and embedding dimension, which have similar computational costs with ResNet-18, 50 and 101 [13]. For hierarchical models, we also design three models (GFNet-H-Ti, GFNet-H-S and GFNet-H-B) that have these three levels of complexity following the design of PVT [44]. We use 4 × 4 patch embedding to form the input tokens and use a non-overlapping convolution layer to downsample tokens following [44,27].…”

Section: Modelmentioning

confidence: 99%

“…For hierarchical models, we also design three models (GFNet-H-Ti, GFNet-H-S and GFNet-H-B) that have these three levels of complexity following the design of PVT [44]. We use 4 × 4 patch embedding to form the input tokens and use a non-overlapping convolution layer to downsample tokens following [44,27]. Unlike PVT [44] and Swin [27], we directly apply our building block on different stages without any modifications.…”

Section: Modelmentioning

confidence: 99%

“…We use 4 × 4 patch embedding to form the input tokens and use a non-overlapping convolution layer to downsample tokens following [44,27]. Unlike PVT [44] and Swin [27], we directly apply our building block on different stages without any modifications. The detailed architectures can be found in Appendix B.…”

Section: Modelmentioning

confidence: 99%

“…The results are presented in Table 4. The proposed models generally work well on All models are equipped with Semantic FPN [19] and trained for 80K iterations following [44]. The FLOPs are tested with 1024 × 1024 input.…”

Section: Downstream Tasksmentioning

confidence: 99%

See 4 more Smart Citations

Global Filter Networks for Image Classification

Rao¹,

Wen-liang²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness. Code is available at https://github.com/raoyongming/GFNet.

show abstract

Section: Imagenet Resultsmentioning

confidence: 99%

Section: Modelmentioning

confidence: 99%

Section: Modelmentioning

confidence: 99%

Section: Modelmentioning

confidence: 99%

Section: Downstream Tasksmentioning

confidence: 99%

See 3 more Smart Citations

Global Filter Networks for Image Classification

Rao¹,

Wen-liang²,

Zhu³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

High‐definition map automatic annotation system based on active learning

Zheng,

Cao,

Tang

et al. 2023

AI Magazine

View full text Add to dashboard Cite

As autonomous vehicle technology advances, high‐definition (HD) maps have become essential for ensuring safety and navigation accuracy. However, creating HD maps with accurate annotations demands substantial human effort, leading to a time‐consuming and costly process. Although artificial intelligence (AI) and computer vision (CV) algorithms have been developed for prelabeling HD maps, a significant gap remains in accuracy and robustness between AI‐based methods and traditional manual pipelines. Additionally, building large‐scale annotated datasets and advanced machine learning algorithms for AI‐based HD map labeling systems can be resource‐intensive. In this paper, we present and summarize the Tencent HD Map AI (THMA) system, an innovative end‐to‐end, AI‐based, active learning HD map labeling system designed to produce HD map labels for hundreds of thousands of kilometers while employing active learning to enhance product iteration. Utilizing a combination of supervised, self‐supervised, and weakly supervised learning, THMA is trained directly on massive HD map datasets to achieve the high accuracy and efficiency required by downstream users. Deployed by the Tencent Map team, THMA serves over 1000 labeling workers and generates more than 30,000 km of HD map data per day at its peak. With over 90% of Tencent Map's HD map data labeled automatically by THMA, the system accelerates traditional HD map labeling processes by more than tenfold, significantly reducing manual annotation burdens and paving the way for more efficient HD map production.

show abstract

TrDosePred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy

Wang

Zhang

et al. 2023

J Applied Clin Med Phys

View full text Add to dashboard Cite

Background: Intensity-Modulated Radiation Therapy (IMRT) has been the standard of care for many types of tumors. However, treatment planning for IMRT is a time-consuming and labor-intensive process. Purpose: To alleviate this tedious planning process, a novel deep learning based dose prediction algorithm (TrDosePred) was developed for head and neck cancers. Methods: The proposed TrDosePred, which generated the dose distribution from a contoured CT image, was a U-shape network constructed with a convolutional patch embedding and several local self -attention based transformers. Data augmentation and ensemble approach were used for further improvement. It was trained based on the dataset from Open Knowledge-Based Planning Challenge (OpenKBP).The performance of TrDosePred was evaluated with two mean absolute error (MAE) based scores utilized by OpenKBP challenge (i.e., Dose score and DVH score) and compared to the top three approaches of the challenge. In addition, several state-of -the-art methods were implemented and compared to TrDosePred. Results: The TrDosePred ensemble achieved the dose score of 2.426 Gy and the DVH score of 1.592 Gy on the test dataset, ranking at 3rd and 9th respectively in the leaderboard on CodaLab as of writing. In terms of DVH metrics, on average, the relative MAE against the clinical plans was 2.25% for targets and 2.17% for organs at risk. Conclusions: A transformer-based framework TrDosePred was developed for dose prediction. The results showed a comparable or superior performance as compared to the previous state-of -the-art approaches, demonstrating the potential of transformer to boost the treatment planning procedures.

show abstract

Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions

Cited by 2,980 publications

References 41 publications

Global Filter Networks for Image Classification

Global Filter Networks for Image Classification

High‐definition map automatic annotation system based on active learning

TrDosePred: A deep learning dose prediction algorithm based on transformers for head and neck cancer radiotherapy

Contact Info

Product

Resources

About