2021
DOI: 10.48550/arxiv.2111.14330
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Abstract: DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10× faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 18 publications
(21 citation statements)
references
References 18 publications
0
13
0
Order By: Relevance
“…Its main concept is to add extra auxiliary head in the middle layers of the network, and the shallow network weights with assistant loss as the guide. Even for architectures such as ResNet [26] and DenseNet [32] which usually converge well, deep supervision [70,98,67,47,82,65,86,50] can still significantly improve the performance of the model on many tasks. Figure 5 (a) and (b) show, respectively, the object detector architecture "without" and "with" deep supervision.…”
Section: Coarse For Auxiliary and Fine For Lead Lossmentioning
confidence: 99%
“…Its main concept is to add extra auxiliary head in the middle layers of the network, and the shallow network weights with assistant loss as the guide. Even for architectures such as ResNet [26] and DenseNet [32] which usually converge well, deep supervision [70,98,67,47,82,65,86,50] can still significantly improve the performance of the model on many tasks. Figure 5 (a) and (b) show, respectively, the object detector architecture "without" and "with" deep supervision.…”
Section: Coarse For Auxiliary and Fine For Lead Lossmentioning
confidence: 99%
“…The novelty of this study is to combine merits of the most recent transformer-based works in CV, including the top-k object query [11], bounding box refinement and two-stage strategy [9], and auxiliary losses in encoder layer [12] to improve the performance in terms of accuracy and efficiency. This combination is integrated into the implementation of the transformer-based encoder-decoder detector which follows the structure in Deformable DETR.…”
Section: Transformer Detectormentioning
confidence: 99%
“…With the development of deep learning, extraordinary processes have been achieved in static image object detection [3,18,38,43,63]. Existing object detectors can be mainly divided into three categories: twostage [3,25,28,42,57], one-stage [43, 45, 49, 54-56, 64, 65] and query-based models [4,23,48,58,63,90]. For better performance, two-stage models generate a set of proposals and then refine the prediction results, like R-CNN fami-lies [11,22,28,57].…”
Section: Introductionmentioning
confidence: 99%