2023
DOI: 10.48550/arxiv.2303.08810
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BiFormer: Vision Transformer with Bi-Level Routing Attention

Abstract: As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated window… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
33
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 30 publications
(33 citation statements)
references
References 38 publications
0
33
0
Order By: Relevance
“…Concerning the model architecture, the integration of the bilevel routing attention (BiLRA) 47 and a new detection head was effected to amplify the detection model's performance concerning diminutive cell entities and the structure of BiLRA is shown in Fig. S3.…”
Section: Model Optimizationmentioning
confidence: 99%
“…Concerning the model architecture, the integration of the bilevel routing attention (BiLRA) 47 and a new detection head was effected to amplify the detection model's performance concerning diminutive cell entities and the structure of BiLRA is shown in Fig. S3.…”
Section: Model Optimizationmentioning
confidence: 99%
“…In this paper, inspired by the application of Bifomer [ 27 ] in vision, a two-layer routing attention mechanism is utilized to filter out most of the irrelevant K-V pairs in the coarse-grained region, followed by applying token-to-token attention to focus on a small portion of the relevant tokens, which provides good performance and computational efficiency since it does not distract the attention of irrelevant tokens. We propose to use stacked multi-head self-attention to construct an inference layer.…”
Section: Text Detection and Recognition Modelmentioning
confidence: 99%
“…To improve the detection performance of small targets in remote sensing images, this paper proposes an improved algorithm based on YOLOv5s-6.0, called YOLOv5s-RSD. The main improvements are summarized in three points: (1) using SPD-Conv [14] to replace the original downsampling method to reduce the loss of detailed information in the downsampling process; (2) adding a small target detection head to fully utilize small target features, and recalculating the optimal pre-selected box using K-means; (3) introducing Biformer [15] attention to enhance the focus on useful features.…”
Section: Introductionmentioning
confidence: 99%