2021
DOI: 10.48550/arxiv.2107.00641
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Focal Self-attention for Local-Global Interactions in Vision Transformers

Jianwei Yang,
Chunyuan Li,
Pengchuan Zhang
et al.

Abstract: Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short-and long-range visual dependencies through self-attention is the key to success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). Many recent works have attempted to reduce the computational and memory cost and improve performance by applying either coarse-grained global attentions … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
108
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 63 publications
(108 citation statements)
references
References 55 publications
(123 reference statements)
0
108
0
Order By: Relevance
“…To address this problem, Swin Transformer restricts attention computation in a local window. Focal transformer (Yang et al, 2021) uses twolevel windows to increase the ability to capture long-range connection for local attention methods. Pyramid vision transformer (PVT) (Wang et al, 2021c) reduce the computation of global attention methods by downsampling key and value tokens.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…To address this problem, Swin Transformer restricts attention computation in a local window. Focal transformer (Yang et al, 2021) uses twolevel windows to increase the ability to capture long-range connection for local attention methods. Pyramid vision transformer (PVT) (Wang et al, 2021c) reduce the computation of global attention methods by downsampling key and value tokens.…”
Section: Related Workmentioning
confidence: 99%
“…13.6 2.3 80.0 ResNet50 (He et al, 2016) 25.1 4.1 76.4 ResNeXt50-32x4d (Xie et al, 2017) 25.0 4.3 77.6 RegNetY-4G (Radosavovic et al, 2020) 21.0 4.0 80.0 DeiT-Small/16 (Touvron et al, 2021) 22.1 4.6 79.9 T2T-ViTt-14 (Yuan et al, 2021b) 22.0 6.7 80.7 Swin-T 29.0 4.5 81.3 CvT-13 20.0 4.5 81.6 TNT-S 23.8 5.2 81.3 CoaT-Lite Small 20.0 4.0 81.9 CeiT (Yuan et al, 2021a) 24.2 4.5 82.0 PVTv2-b2 (Wang et al, 2021c) 25.4 4.0 82.0 Focal-T (Yang et al, 2021) 29.1 4.9 82.2 QuadTree-B-b2 (ours) 24.2 4.5 82.7 ResNet101 (He et al, 2016) 44.7 7.9 77.4 ResNeXt101-32x4d (Xie et al, 2017) 44.2 8.0 78.8 RegNetY-8G (Radosavovic et al, 2020) 39.0 8.0 81.7 CvT-21 32.0 7.1 82.5 PVTv2-b3 (Wang et al, 2021c) 45.2 6.9 83.2 Quadtree-B-b3 (ours) 46.3 7.8 83.7 ResNet152 (He et al, 2016) 60.2 11.6 78.3 T2T-ViTt-24 (Yuan et al, 2021b) 64.0 15.0 82.2 Swin-S 50.0 8.7 83.0 Focal-Small (Yang et al, 2021) 51.1 9.1 83.5 PVTv2-b4 (Wang et al, 2021c) 62.6 10.1 83.6 Quadtree-B-b4 (ours) 64.2 11.5 84.0…”
Section: Image Classificationmentioning
confidence: 99%
“…Lots of follow-up studies are proposed [4,5,9,12,18,21,23,24,[27][28][29]31,38,41,43,45,50,52,56,76,77,80,81,84]. Many of them analyze the ViT [15,17,26,32,44,55,69,73,75,82] and improve it via introducing locality to earlier layers [11,17,48,64,79,83,87]. In particular, Raghu et al [55] observe that the first few layers in ViTs focus on local information.…”
Section: Related Workmentioning
confidence: 99%
“…We report the performance on the validation subset, and use the mean average precision (AP) as the metric. We evaluate ELSA-Swin in Mask RCNN / Cascade Mask RCNN [2,33], which is a common practice in [6,70,71,79,87]. Following the common training protocol, we apply multi-scale training, scaling the shorter side of the input from 480 to 800 while keeping the longer side no more than 1333.…”
Section: Object Detection On Cocomentioning
confidence: 99%
“…The classification stage has a global receptive field, which enables implicit modeling of context between detections. Transformer-based methods [8,27,10] currently lead the COCO 2017 [4] object detection benchmark. These methods detect objects using cross-attention between the learned object queries and visual embedding keys, as well as self-attention between object queries to capture their interrelations in the scene context.…”
Section: Related Workmentioning
confidence: 99%