2018
DOI: 10.48550/arxiv.1811.08188
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Orthographic Feature Transform for Monocular 3D Object Detection

Abstract: 3D object detection from monocular images has proven to be an enormously challenging task, with the performance of leading systems not yet achieving even 10% of that of LiDAR-based counterparts. One explanation for this performance gap is that existing systems are entirely at the mercy of the perspective image-based representation, in which the appearance and scale of objects varies drastically with depth and meaningful distances are difficult to infer. In this work we argue that the ability to reason about th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
62
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
7
2
1

Relationship

0
10

Authors

Journals

citations
Cited by 54 publications
(68 citation statements)
references
References 35 publications
0
62
0
Order By: Relevance
“…Similarly, Roddick et al (2018) transform monocular representation to BEV perspective. They introduce an orthographic feature transformation network that maps the features from the RGB perspective to a 3D voxel map.…”
Section: Informed Monocular Approachesmentioning
confidence: 99%
“…Similarly, Roddick et al (2018) transform monocular representation to BEV perspective. They introduce an orthographic feature transformation network that maps the features from the RGB perspective to a 3D voxel map.…”
Section: Informed Monocular Approachesmentioning
confidence: 99%
“…An early method for 3D detection from RGB images is Mono3D [21], which uses semantic and shape cues to select from a collection of 3D proposals, using scene constraints and additional priors at training time. [22] uses the birds-eye-view (BEV) for monocular 3D detection, and [23] leverages 2D detections for 3D bounding box regression via the minimization of 2D-3D projection error. The use of 2D detectors as a starting point for 3D computation recently has become a standard approach [24,25].…”
Section: Related Workmentioning
confidence: 99%
“…A common approach [2,11,13,14,20,29] is to use inverse perspective mapping (IPM) to map front-view image onto the ground plane via homography projection. OFT-Net [25] projects a fixed volume of voxels onto multiview images to collect features and complete 3D detection on the bird's-eye-view feature representation. "Lift, Split, Shoot" idea [24] is proposed to infer birds'-eye-view representation by lifting each image into a frustum of features and collapsing all frustum into a rasterized bird's-eye-view grid.…”
Section: Related Workmentioning
confidence: 99%