Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Sun, Peize; Zhang, Rufeng; Jiang, Yi; Kong, Tao; Xu, Chenfeng; Zhan, Wei; Tomizuka, Masayoshi; Li, Lei; Yuan, Zehuan; Wang, Changhu; Luo, Ping

doi:10.48550/arxiv.2011.12450

Cited by 46 publications

(114 citation statements)

References 45 publications

Supporting

Mentioning

114

Contrasting

Order By: Relevance

“…This include the vanilla DETR [3] method improved with 300 queries, reference points, and focal loss as described by [41] and the Deformable DETR [41]. We also presented the reported performance of RCNN-based methods [2,18,27,29,30,33] and other DETR variants [6,8,23,36,39]. From the results in Table 1, we can observe that our method consistently improves different R50-based baseline methods by around 2 points in AP using 50 epochs.…”

Section: Comparison With Different Detr Methodsmentioning

confidence: 80%

“…Despite effectiveness, such type of RoI-based refinement methodology can not be directly applied to the fully end-to-end pipeline of DETR because they rely on different optimization goals and still require NMS. More recently, some methods, like Efficient DETR [39], TSP-RCNN [30], and SparseRCNN [29], also uses RoIs to achieve improved performance with a Transformer and can also avoid the NMS. However, we argue that these methods are still based on the typical two-stage detection pipeline like Faster RCNN [27] and they only apply Transformer mainly to approximate NMS.…”

Section: Improvement Of Transformer In Computer Visionmentioning

confidence: 99%

See 1 more Smart Citation

Recurrent Glimpse-based Decoder for Detection with Transformer

Chen¹,

Zhang²,

Dacheng³

2021

Preprint

View full text Add to dashboard Cite

Although detection with Transformer (DETR) is increasingly popular, its global attention modeling requires an extremely long training period to optimize and achieve promising detection performance. Alternative to existing studies that mainly develop advanced feature or embedding designs to tackle the training issue, we point out that the Region-of-Interest (RoI) based detection refinement can easily help mitigate the difficulty of training for DETR methods. Based on this, we introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper. In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects more accurately. In each processing stage, visual features are extracted as glimpse features from RoIs with enlarged bounding box areas of detection results from the previous stage. Then, a glimpse-based decoder is introduced to provide refined detection results based on both the glimpse features and the attention modeling outputs of the previous stage. In practice, REGO can be easily embedded in representative DETR variants while maintaining their fully end-to-end training and inference pipelines. In particular, REGO helps Deformable DETR achieve 44.8 AP on the MSCOCO dataset with only 36 training epochs, compared with the first DETR and the Deformable DETR that require 500 and 50 epochs to achieve comparable performance, respectively. Experiments also show that REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs. Code is available via https://github.com/zhechen/Deformable-DETR-REGO.

show abstract

Section: Comparison With Different Detr Methodsmentioning

confidence: 80%

Section: Improvement Of Transformer In Computer Visionmentioning

confidence: 99%

Recurrent Glimpse-based Decoder for Detection with Transformer

Chen¹,

Zhang²,

Dacheng³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…The initial learning rate is 0.01, and it decreases with the factor 0.1 in the 8 th and 11 th epoch. We choose Faster RCNN with FPN and Sparse RCNN [37] for comparison.…”

Section: Methodsmentioning

confidence: 99%

P2P-Loc: Point to Point Tiny Person Localization

Yu¹,

Wu²,

Ye³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Anchor-free approaches [15,40,49] replace the hand-crafted anchors by reference points. Recently, end-to-end detectors [10,37,50] remove the hand-crafted anchors and non-maximum suppression via bipartite matching. The implicit feature refinement introduced in this paper can be used to refine the instance features of one-stage object detectors as well.…”

Section: Object Detectionmentioning

confidence: 99%

Implicit Feature Refinement for Instance Segmentation

Wang²,

Dong³

et al. 2021

Proceedings of the 29th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We propose a novel implicit feature refinement module for highquality instance segmentation. Existing image/video instance segmentation methods rely on explicitly stacked convolutions to refine instance features before the final prediction. In this paper, we first give an empirical comparison of different refinement strategies, which reveals that the widely-used four consecutive convolutions are not necessary. As an alternative, weight-sharing convolution blocks provides competitive performance. When such block is iterated for infinite times, the block output will eventually converge to an equilibrium state. Based on this observation, the implicit feature refinement (IFR) is developed by constructing an implicit function. The equilibrium state of instance features can be obtained by fixed-point iteration via a simulated infinite-depth network. Our IFR enjoys several advantages: 1) simulates an infinite-depth refinement network while only requiring parameters of single residual block; 2) produces high-level equilibrium instance features of global receptive field; 3) serves as a plug-and-play general module easily extended to most object recognition frameworks. Experiments on the COCO and YouTube-VIS benchmarks show that our IFR achieves improved performance on state-of-the-art image/video instance segmentation frameworks, while reducing the parameter burden (e.g. 1% AP improvement on Mask R-CNN with only 30.0% parameters in mask head). Code is made available at https://github.com/lufanma/IFR.git.

show abstract

Sparse R-CNN: End-to-End Object Detection with Learnable Proposals

Cited by 46 publications

References 45 publications

Recurrent Glimpse-based Decoder for Detection with Transformer

Recurrent Glimpse-based Decoder for Detection with Transformer

P2P-Loc: Point to Point Tiny Person Localization

Implicit Feature Refinement for Instance Segmentation

Contact Info

Product

Resources

About