2021
DOI: 10.48550/arxiv.2111.15193
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Shunted Self-Attention via Multi-Scale Token Aggregation

Abstract: Recent Vision Transformer (ViT) models have demonstrated encouraging results across various computer vision tasks, thanks to their competence in modeling longrange dependencies of image patches or tokens via selfattention. These models, however, usually designate the similar receptive fields of each token feature within each layer. Such a constraint inevitably limits the ability of each self-attention layer in capturing multi-scale features, thereby leading to performance degradation in handling images with mu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 29 publications
0
6
0
Order By: Relevance
“…Here, the local context enhancement term LCE(V) is introduced [23]. The function LCE(•) uses deep convolutional parameterization with a convolutional kernel size of 5.…”
Section: Bi-level Routing Spatial Attention Modulementioning
confidence: 99%
“…Here, the local context enhancement term LCE(V) is introduced [23]. The function LCE(•) uses deep convolutional parameterization with a convolutional kernel size of 5.…”
Section: Bi-level Routing Spatial Attention Modulementioning
confidence: 99%
“…These matrices are then combined with the LCE(.) function [14] and applied to attention. Notably, the LCE(.)…”
Section: Figure 3 Structure Of C3 C2f and Fast_c2fmentioning
confidence: 99%
“…Use the Softmax function to complete the attention calculation on the aggregated key-value pairs. In addition, the local context enhancement (LCE) [60] function is also introduced to improve the context representation of each value with depth-wise separable convolution, of which convolution kernel size is set to 5. And the output result is expressed as follows:…”
Section: Brgatt-based Object Detector Networkmentioning
confidence: 99%