2023
DOI: 10.1109/tpami.2022.3202765
|View full text |Cite
|
Sign up to set email alerts
|

P2T: Pyramid Pooling Transformer for Scene Understanding

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
50
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 132 publications
(50 citation statements)
references
References 58 publications
0
50
0
Order By: Relevance
“…VGG16 [31], ResNet50 [32], and MobileNetv2 [50]) and Transformer backbones (e.g. Swin Transformer [24] and P2T [38]) to evaluate the effectiveness of our proposed asymmetric architecture. From the quantitative comparison results in Table 2, we can see that both symmetric and asymmetric two‐stream structures, Tranformer‐based methods (PVTv2, Swin, and P2T) generally achieve superior performance compared to pure CNN‐based methods, which can be attributed to the powerful ability of Transformer to model long‐range dependencies.…”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…VGG16 [31], ResNet50 [32], and MobileNetv2 [50]) and Transformer backbones (e.g. Swin Transformer [24] and P2T [38]) to evaluate the effectiveness of our proposed asymmetric architecture. From the quantitative comparison results in Table 2, we can see that both symmetric and asymmetric two‐stream structures, Tranformer‐based methods (PVTv2, Swin, and P2T) generally achieve superior performance compared to pure CNN‐based methods, which can be attributed to the powerful ability of Transformer to model long‐range dependencies.…”
Section: Methodsmentioning
confidence: 99%
“…The success of Vision Transformer [23] in the field of image recognition has led to the widespread application of Transformer models in computer vision tasks. Various Transformer backbones following the hierarchical structure of VGG [36] and ResNet [37] have emerged successively, such as Swin Transformer [24], PVT [25, 26], and P2T [38]. These Transformer‐based methods have achieved state‐of‐the‐art performance in a variety of computer vision tasks such as detection [39] and segmentation [40], demonstrating the great potential of the Transformer.…”
Section: Related Workmentioning
confidence: 99%
“…PVTv1 [39] PVTv1-L 512 × 512 44.8 -PVTv2 [40] PVTv2-B5 512 × 512 48.7 -P2T [41] P2T-L 512 × 512 49.4 -Swin-UperNet [42,47] Swin-L † 640 × 640 -53.5 FaPN-MaskFormer [10,48] Swin-L † 640 × 640 55.2 56.7 BEiT-UperNet [4,47] BEiT Since one of the main objectives of the proposed auxiliary CNN is to improve the segmentation performance in complex scenes which include small objects and require detailed local information for accurate segmentation, we show several qualitative results for the ADE20K and Cityscapes datasets. Figure 2 presents the qualitative results of the ADE20K validation dataset.…”
Section: Backbone Crop Size Miou (Ss) Miou (Ms)mentioning
confidence: 99%
“…PVT [ 39 , 40 ] borrows the pyramid structure concept in CNNs and designs a pyramid vision transformer for learning multi-scale features with high resolutions. P2T [ 41 ] implements a pooling-based self-attention module with depthwise convolutional operations for multi-scale feature learning. Ref.…”
Section: Related Workmentioning
confidence: 99%
“…The original ViT [5] is a plain, non-hierarchical architecture. Various hierarchical transformers such as [17][18][19][20][21] have been presented afterwards. These methods inherit some designs from convolution based networks such as hierarchical structures, pooling and down-sampling with convolutions.…”
Section: Related Workmentioning
confidence: 99%