Orthographic Feature Transform for Monocular 3D Object Detection

Roddick, Thomas; Kendall, Alex; Cipolla, Roberto

doi:10.48550/arxiv.1811.08188

Cited by 54 publications

(68 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, Roddick et al (2018) transform monocular representation to BEV perspective. They introduce an orthographic feature transformation network that maps the features from the RGB perspective to a 3D voxel map.…”

Section: Informed Monocular Approachesmentioning

confidence: 99%

Survey and Systematization of 3D Object Detection Models and Methods

Moritz¹,

Friederich²,

Egger³

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper offers a comprehensive survey of recent developments in 3D object detection covering the full pipeline from input data, over data representation and feature extraction to the actual detection modules. We include basic concepts, focus our survey on a broad spectrum of different approaches arising in the last ten years and propose a systematization which offers a practical framework to compare those approaches on the methods level.

show abstract

Section: Informed Monocular Approachesmentioning

confidence: 99%

Survey and Systematization of 3D Object Detection Models and Methods

Moritz¹,

Friederich²,

Egger³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…An early method for 3D detection from RGB images is Mono3D [21], which uses semantic and shape cues to select from a collection of 3D proposals, using scene constraints and additional priors at training time. [22] uses the birds-eye-view (BEV) for monocular 3D detection, and [23] leverages 2D detections for 3D bounding box regression via the minimization of 2D-3D projection error. The use of 2D detectors as a starting point for 3D computation recently has become a standard approach [24,25].…”

Section: Related Workmentioning

confidence: 99%

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Wang¹,

Guizilini²,

Zhang³

et al. 2021

Preprint

View full text Add to dashboard Cite

We introduce a framework for multi-camera 3D object detection. In contrast to existing works, which estimate 3D bounding boxes directly from monocular images or use depth prediction networks to generate input for 3D object detection from 2D information, our method manipulates predictions directly in 3D space. Our architecture extracts 2D features from multiple camera images and then uses a sparse set of 3D object queries to index into these 2D features, linking 3D positions to multi-view images using camera transformation matrices. Finally, our model makes a bounding box prediction per object query, using a set-to-set loss to measure the discrepancy between the ground-truth and the prediction. This top-down approach outperforms its bottom-up counterpart in which object bounding box prediction follows per-pixel depth estimation, since it does not suffer from the compounding error introduced by a depth prediction model. Moreover, our method does not require post-processing such as non-maximum suppression, dramatically improving inference speed. We achieve state-of-the-art performance on the nuScenes autonomous driving benchmark. *: Equal contribution. ¶: Co-advise on the project.

show abstract

“…A common approach [2,11,13,14,20,29] is to use inverse perspective mapping (IPM) to map front-view image onto the ground plane via homography projection. OFT-Net [25] projects a fixed volume of voxels onto multiview images to collect features and complete 3D detection on the bird's-eye-view feature representation. "Lift, Split, Shoot" idea [24] is proposed to infer birds'-eye-view representation by lifting each image into a frustum of features and collapsing all frustum into a rasterized bird's-eye-view grid.…”

Section: Related Workmentioning

confidence: 99%

Voxelized 3D Feature Aggregation for Multiview Detection

Ma¹,

Tong²,

Wang³

et al. 2021

Preprint

View full text Add to dashboard Cite

Multi-view detection incorporates multiple camera views to alleviate occlusion in crowded scenes, where the stateof-the-art approaches adopt homography transformations to project multi-view features to the ground plane. However, we find that these 2D transformations do not take into account the object's height, and with this neglection features along the vertical direction of same object are likely not projected onto the same ground plane point, leading to impure ground-plane features. To solve this problem, we propose VFA, voxelized 3D feature aggregation, for feature transformation and aggregation in multi-view detection. Specifically, we voxelize the 3D space, project the voxels onto each camera view, and associate 2D features with these projected voxels. This allows us to identify and then aggregate 2D features along the same vertical line, alleviating projection distortions to a large extent. Additionally, because different kinds of objects (human vs. cattle) have different shapes on the ground plane, we introduce the oriented Gaussian encoding to match such shapes, leading to increased accuracy and efficiency. We perform experiments on multiview 2D detection and multiview 3D detection problems. Results on four datasets (including a newly introduced MultiviewC dataset) show that our system is very competitive compared with the state-ofthe-art approaches. Code and MultiviewC are released at https://github.com/Robert-Mar/VFA.

show abstract

Orthographic Feature Transform for Monocular 3D Object Detection

Cited by 54 publications

References 35 publications

Survey and Systematization of 3D Object Detection Models and Methods

Survey and Systematization of 3D Object Detection Models and Methods

DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries

Voxelized 3D Feature Aggregation for Multiview Detection

Contact Info

Product

Resources

About