Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

Li, Xiang; Wang, Wenhai; Hu, Xiaolin; Li, Jun; Tang, Jinhui; Yang, Jian

doi:10.1109/cvpr46437.2021.01146

Cited by 428 publications

(419 citation statements)

References 24 publications

Supporting

Mentioning

417

Contrasting

Unclassified

Order By: Relevance

“…For fair comparison between PVTv2 and Swin Transformer [21], we keep all settings the same, including ImageNet-1K pre-training and COCO fine-tuning strategies. We evaluate Swin Transformer and PVTv2 on four state-of-the-arts detectors, including Cascade R-CNN [1], ATSS [37], GFL [17], and Sparse R-CNN [26]. We see PVTv2 obtain much better AP than Swin Transformer among all the detectors, showing its better feature representation ability.…”

Section: Object Detectionmentioning

confidence: 94%

“…All models are trained on COCO train2017 (118k images) and evaluated on val2017 (5k images). We verify the effectiveness of PVTv2 backbones on top of mainstream detectors, including RetinaNet [19], Mask R-CNN [11], Cascade Mask R-CNN [1], ATSS [37], GFL [17], and Sparse R-CNN [26].…”

Section: Object Detectionmentioning

confidence: 99%

“…Specifically, PVTv2-B5 1 yields 83.8% top-1 error on Ima-geNet, which is significantly better than Swin-B [21] and Twins-SVT-L [4], while PVTv2-B5 has fewer parameters and GFLOPS. Moreover, GFL [17] with PVT-B2 archives 50.2 AP on COCO val2017, 2.6 AP higher than the one with Swin-T [21], 5.7 AP higher than the one with ResNet50 [12]. We hope these improved baselines will provide a reference for future research in vision Transformer.…”

Section: Introductionmentioning

confidence: 95%

See 2 more Smart Citations

PVT v2: Improved Baselines with Pyramid Vision Transformer

Wang,

Xie,

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Transformer recently has shown encouraging progresses in computer vision. In this work, we present new baselines by improving the original Pyramid Vision Transformer (abbreviated as PVTv1) by adding three designs, including (1) overlapping patch embedding, (2) convolutional feedforward networks, and (3) linear complexity attention layers.With these modifications, our PVTv2 significantly improves PVTv1 on three tasks e.g., classification, detection, and segmentation. Moreover, PVTv2 achieves comparable or better performances than recent works such as Swin Transformer. We hope this work will facilitate state-of-theart Transformer researches in computer vision. Code is available at https://github.com/whai362/PVT.

show abstract

Section: Object Detectionmentioning

confidence: 94%

Section: Object Detectionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 95%

See 1 more Smart Citation

PVT v2: Improved Baselines with Pyramid Vision Transformer

Wang,

Xie,

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…DETR [5] adopts a three-layer perceptron to predict object box coordinates. However, as pointed by GFLoss [28], directly regressing the coordinates is equivalent to fitting a Dirac delta distribution, which fails to consider the ambiguity and uncertainty in the datasets. This representation is not flexible and not robust to challenges such as occlusion and cluttered background in object tracking.…”

Section: A Simple Baseline Based On Transformermentioning

confidence: 99%

Learning Spatio-Temporal Transformer for Visual Tracking

Yan¹,

Peng²,

Fu³

et al. 2021

Preprint

View full text Add to dashboard Cite

In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models the global spatio-temporal feature dependencies between target objects and search regions, while the decoder learns a query embedding to predict the spatial positions of the target objects. Our method casts object tracking as a direct bounding box prediction problem, without using any proposals or predefined anchors. With the encoder-decoder transformer, the prediction of objects just uses a simple fully-convolutional network, which estimates the corners of objects directly. The whole method is endto-end, does not need any postprocessing steps such as cosine window and bounding box smoothing, thus largely simplifying existing tracking pipelines. The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running at real-time speed, being 6× faster than Siam R-CNN [47]. Code and models are open-sourced at here.

show abstract

“…The lowest evaluation scheme layer is an alternative scheme for decision-making, where seven test algorithms are selected, Faster RCNN [38], RetinaNet [39], ATSS [40], FoveaBox [41], GFocal Loss [42], PAFPN [43], and RepPoints [44], representing seven schemes. The middle layer is the criterion that needs to be considered in the evaluation process, namely, four evaluation indexes: illumination index, environment index, scale index, and angle index.…”

Section: Construction Of Hierarchical Structurementioning

confidence: 99%

Benchmarking the Robustness of Object Detection Based on Near-Real Military Scenes

Zhang

Fang

et al. 2022

Wireless Communications and Mobile Computing

View full text Add to dashboard Cite

According to the technical requirements of intelligent development of auxiliary combat system, we construct a visual intelligent test platform. A near-real military scene dataset based on physical rendering is built, which contains 11,000 remote sensing images collected by an analog camera taking pictures in different illumination, weather environment, camera shooting angle, and scene scale condition. Besides, we add a natural style transfer module for a single unmodeled military scene image’s multienvironment generation. We conduct experiments to evaluate the stability of several UAV remote sensing image object detection algorithms. Based on the quality and speed value of the tested algorithms, the adaptability scores in different environments are calculated. Furthermore, we propose a comprehensive evaluation index system of military remote object detection based on a hierarchical model. We envision that our comprehensive benchmark will play a role in the evaluation of algorithm capability for military object detection tasks and the improvement of training algorithm capability.

show abstract

Generalized Focal Loss V2: Learning Reliable Localization Quality Estimation for Dense Object Detection

Cited by 428 publications

References 24 publications

PVT v2: Improved Baselines with Pyramid Vision Transformer

PVT v2: Improved Baselines with Pyramid Vision Transformer

Learning Spatio-Temporal Transformer for Visual Tracking

Benchmarking the Robustness of Object Detection Based on Near-Real Military Scenes

Contact Info

Product

Resources

About