HoughNet: Integrating Near and Long-Range Evidence for Bottom-Up Object Detection

Samet, Nermin; Hiçsönmez, Samet; Akbaş, Emre

doi:10.1007/978-3-030-58595-2_25

Cited by 42 publications

(12 citation statements)

References 68 publications

(133 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, to overcome the limitations of top-down methods, we propose a bottom-up whole-body pose estimation method. Our approach is inspired from center-point based bottom-up object detection methods [60,46,13,31]. These methods can be easily extended to perform keypoint estimation task [60,47].…”

Section: Inroductionmentioning

confidence: 99%

“…Our approach is inspired from center-point based bottom-up object detection methods [60,46,13,31]. These methods can be easily extended to perform keypoint estimation task [60,47]. For example, CenterNet [60], defines each keypoint with an offset to the center of the person instance and directly regresses them.…”

Section: Inroductionmentioning

confidence: 99%

“…Later, detected keypoints are grouped and assigned to person instances. Recently, center-based object detection methods [60] have been extended to perform human pose estimation [60,47]. These methods represent keypoints with an offset value to the center of the person box and directly regresses them during training.…”

Section: Human Body Pose Estimationmentioning

confidence: 99%

“…• For the optimization of the Person Center Heatmap and Body Keypoint Heatmap branches, we use the modified focal loss [33] as in [31,61,60,46]. We optimize the Person Center Offset, Body Keypoint Offset, Hand Keypoint Offset and Face Keypoint Offset branches using the L1 loss similar to the bottom-up object detectors [31,61,60,46]. Finally, for the Person Box H & W and Face Box H & W branches, we use L1 loss and scale it by 0.1 as in CenterNet [60].…”

Section: Network Architecturementioning

confidence: 99%

See 3 more Smart Citations

HPRNet: Hierarchical Point Regression for Whole-Body Human Pose Estimation

Samet¹,

Akbaş²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we present a new bottom-up one-stage method for whole-body pose estimation, which we name "hierarchical point regression," or HPRNet for short, referring to the network that implements this method. To handle the scale variance among different body parts, we build a hierarchical point representation of body parts and jointly regress them. Unlike the existing two-stage methods, our method predicts whole-body pose in a constant time independent of the number of people in an image. On the COCO WholeBody dataset, HPRNet significantly outperforms all previous bottom-up methods on the keypoint detection of all whole-body parts (i.e. body, foot, face and hand); it also achieves state-of-the-art results in the face (75.4 AP) and hand (50.4 AP) keypoint detection. Code and models are available at https://github.com/nerminsamet/ HPRNet.git.

show abstract

Section: Inroductionmentioning

confidence: 99%

Section: Inroductionmentioning

confidence: 99%

Section: Human Body Pose Estimationmentioning

confidence: 99%

Section: Network Architecturementioning

confidence: 99%

See 2 more Smart Citations

HPRNet: Hierarchical Point Regression for Whole-Body Human Pose Estimation

Samet¹,

Akbaş²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Corner representation A bounding box can be determined by two points, e.g., a top-left corner and a bottom-right corner. Some approaches [30,15,16,7,21,39,26] first detect these individual points and then compose bounding boxes from them. We refer to these representation methods as corner representation.…”

Section: Introductionmentioning

confidence: 99%

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

Chi,

Wei,

2020

Preprint

View full text Add to dashboard Cite

Existing object detection frameworks are usually built on a single format of object/part representation, i.e., anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in Corner-Net. While these different representations usually drive the frameworks to perform well in different aspects, e.g., better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer [31] to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of key instances to strengthen the main query representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a key sampling approach and a shared location embedding approach. The proposed module is named bridging visual representations (BVR). It can perform in-place and we demonstrate its broad effectiveness in bridging other representations into prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS, where about 1.5 ∼ 3.0 AP improvements are achieved. In particular, we improve a state-of-the-art framework with a strong backbone by about 2.0 AP, reaching 52.7 AP on COCO test-dev. The resulting network is named RelationNet++. The code will be available at https://github.com/microsoft/RelationNet2.

show abstract