Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Roh, Byungseok; Shin, Jaewoong; Shin, Woong‐Chul; Kim, Saehoon

doi:10.48550/arxiv.2111.14330

Cited by 18 publications

(21 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Its main concept is to add extra auxiliary head in the middle layers of the network, and the shallow network weights with assistant loss as the guide. Even for architectures such as ResNet [26] and DenseNet [32] which usually converge well, deep supervision [70,98,67,47,82,65,86,50] can still significantly improve the performance of the model on many tasks. Figure 5 (a) and (b) show, respectively, the object detector architecture "without" and "with" deep supervision.…”

Section: Coarse For Auxiliary and Fine For Lead Lossmentioning

confidence: 99%

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Wang¹,

Bochkovskiy²,

Liao³

2022

Preprint

1,028

707

View full text Add to dashboard Cite

YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 160 FPS and has the highest accuracy 56.8% AP among all known real-time object detectors with 30 FPS or higher on GPU V100. YOLOv7-E6 object detector (56 FPS V100, 55.9% AP) outperforms both transformer-based detector SWIN-L Cascade-Mask R-CNN (9.2 FPS A100, 53.9% AP) by 509% in speed and 2% in accuracy, and convolutionalbased detector ConvNeXt-XL Cascade-Mask R-CNN (8.6 FPS A100, 55.2% AP) by 551% in speed and 0.7% AP in accuracy, as well as YOLOv7 outperforms: YOLOR, YOLOX, Scaled-YOLOv4, YOLOv5, DETR, Deformable DETR, DINO-5scale-R50, ViT-Adapter-B and many other object detectors in speed and accuracy. Moreover, we train YOLOv7 only on MS COCO dataset from scratch without using any other datasets or pre-trained weights. Source code is released in https:// github.com/ WongKinYiu/ yolov7.

show abstract

Section: Coarse For Auxiliary and Fine For Lead Lossmentioning

confidence: 99%

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Wang¹,

Bochkovskiy²,

Liao³

2022

Preprint

1,028

707

View full text Add to dashboard Cite

show abstract

“…The novelty of this study is to combine merits of the most recent transformer-based works in CV, including the top-k object query [11], bounding box refinement and two-stage strategy [9], and auxiliary losses in encoder layer [12] to improve the performance in terms of accuracy and efficiency. This combination is integrated into the implementation of the transformer-based encoder-decoder detector which follows the structure in Deformable DETR.…”

Section: Transformer Detectormentioning

confidence: 99%

Transformer-Based Approach for Document Layout Understanding

Yang

Hsu

2022

2022 IEEE International Conference on Image Processing (ICIP)

View full text Add to dashboard Cite

We present an end-to-end transformer-based framework named TRDLU for the task of Document Layout Understanding (DLU). DLU is the fundamental task to automatically understand document structures. To accurately detect content boxes and classify them into semantically meaningful classes from various formats of documents is still an open challenge. Recently, transformer-based detection neural networks have shown their capability over traditional convolutional-based methods in the object detection area. In this paper, we consider DLU as a detection task, and introduce TRDLU which integrates transformer-based vision backbone and transformer encoder-decoder as detection pipeline. TRDLU is only a visual feature-based framework, but its performance is even better than multi-modal feature-based models. To the best of our knowledge, this is the first study of employing a fully transformer-based framework in DLU tasks. We evaluated TRDLU on three different DLU benchmark datasets, each with strong baselines. TRDLU outperforms the current stateof-the-art methods on all of them.

show abstract

“…With the development of deep learning, extraordinary processes have been achieved in static image object detection [3,18,38,43,63]. Existing object detectors can be mainly divided into three categories: twostage [3,25,28,42,57], one-stage [43, 45, 49, 54-56, 64, 65] and query-based models [4,23,48,58,63,90]. For better performance, two-stage models generate a set of proposals and then refine the prediction results, like R-CNN fami-lies [11,22,28,57].…”

Section: Introductionmentioning

confidence: 99%

FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

Cui¹

2023

Preprint

View full text Add to dashboard Cite

Video object detection needs to solve feature degradation situations that rarely happen in the image domain. One solution is to use the temporal information and fuse the features from the neighboring frames. With Transformerbased object detectors getting a better performance on the image domain tasks, recent works began to extend those methods to video object detection. However, those existing Transformer-based video object detectors still follow the same pipeline as those used for classical object detectors, like enhancing the object feature representations by aggregation. In this work, we take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation. To achieve this goal, we first propose a vanilla query aggregation module that weighted averages the queries according to the features of the neighboring frames. Then, we extend the vanilla module to a more practical version, which generates and aggregates queries according to the features of the input frames. Extensive experimental results validate the effectiveness of our proposed methods: On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformerbased object detectors can be improved by more than 2.4% on mAP and 4.2% on AP 50 .

show abstract

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Cited by 18 publications

References 18 publications

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors

Transformer-Based Approach for Document Layout Understanding

FAQ: Feature Aggregated Queries for Transformer-based Video Object Detectors

Contact Info

Product

Resources

About