2018
DOI: 10.1007/978-3-030-01240-3_21
|View full text |Cite
|
Sign up to set email alerts
|

DetNet: Design Backbone for Object Detection

Abstract: Recent CNN based object detectors, no matter one-stage methods like YOLO [1,2], SSD [3], and RetinaNet [4] or two-stage detectors like Faster R-CNN [5], R-FCN [6] and FPN [7] are usually trying to directly finetune from ImageNet pre-trained models designed for image classification. There has been little work discussing on the backbone feature extractor specifically designed for the object detection. More importantly, there are several differences between the tasks of image classification and object detection.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
241
0

Year Published

2018
2018
2020
2020

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 362 publications
(242 citation statements)
references
References 44 publications
1
241
0
Order By: Relevance
“…DetNet [36] uses 1×1 convolution projection instead of identity mapping although stages 4, 5, and 6 have the same spatial resolution. Our results ( Figure 5 Right) imply that the design keeps stages 4 and 5 away from the output layer, and avoids too sparse representation.…”
Section: Understanding Prior Work With Our Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…DetNet [36] uses 1×1 convolution projection instead of identity mapping although stages 4, 5, and 6 have the same spatial resolution. Our results ( Figure 5 Right) imply that the design keeps stages 4 and 5 away from the output layer, and avoids too sparse representation.…”
Section: Understanding Prior Work With Our Resultsmentioning
confidence: 99%
“…(iii) The strides of conv5 x are too coarse to localize objects. DetNet [36] and ScratchDet [91] also discuss this problem and change the strides for object detection. Unlike these works, our finding is that SGD (with other regularization methods) automatically limits the intrinsic dimensionalities of standard ResNet without changing the strides.…”
Section: Eigenspectrum Dynamicsmentioning
confidence: 99%
“…In our study, we claim that the detection of small and occluded objects depends not only on detail features but also on semantic features and the contextual information [17]. Deep features have better expression towards the main characteristics of objects and more accurate semantic description of the objects in the scenes [13,15]. MDFN can effectively learn the deep features and yield compelling results on popular benchmark datasets.…”
Section: Introductionmentioning
confidence: 88%
“…According to [28], equation (2) performs well relying on the strong assumption that each feature map being fed into the final layer has to be sufficiently sophisticated to be helpful for detection and accurate localization of the objects. This is based on the following assumptions: 1) These feature maps should be able to provide the fine details especially for those from the earlier layers; 2) the function that transforms feature maps should be extended to the layers that are deep enough so that the high-level abstract information of the objects can be built into the feature maps; and 3) the feature maps should contain appropriate contextual information such that the occluded objects, small objects, blurred or overlapping ones can be inferred exactly and localized robustly [28,33,13].…”
Section: Deep Feature Extraction and Analysismentioning
confidence: 99%
“…Furthermore, YOLOv2 [26] employs fully convolution network that results in m × n grids (m, n are the width and height of the output feature) and uses predefined anchors to better predict the bounding boxes of the objects. In [16], Li et al propose a backbone network to improve the accuracy by maintaining high resolution for feature maps and reduce the computation complexity by decreasing the width of upper layers.…”
Section: Face Detectionmentioning
confidence: 99%