ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

Rukhovich, Danila; Vorontsova, Anna; Konushin, Anton

doi:10.1109/wacv51458.2022.00133

Cited by 133 publications

(59 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Methods transform image features into BEV features with the depth information from depth estimation [46] or categorical depth distribution [34]. OFT [36] and ImVoxelNet [37] project the predefined voxels onto image features to generate the voxel representation of the scene.…”

Section: Camera-based 3d Perceptionmentioning

confidence: 99%

“…(1) Using the global attention to replace deformable attention; (2) Making each query only interact with its reference points rather than the surrounding local regions, and it is similar to previous methods [36,37]. For a broader comparison, we also replace the BEVFormer with the BEV generation methods proposed by VPN [30] and Lift-Spalt [32].…”

Section: Ablation Studymentioning

confidence: 99%

See 1 more Smart Citation

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Li¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate spatial information, we design a spatial cross-attention that each BEV query extracts the spatial features from the regions of interest across camera views. For temporal information, we propose a temporal self-attention to recurrently fuse the history BEV information. Our approach achieves the new state-of-the-art 56.9% in terms of NDS metric on the nuScenes test set, which is 9.0 points higher than previous best arts and on par with the performance of LiDAR-based baselines. We further show that BEVFormer remarkably improves the accuracy of velocity estimation and recall of objects under low visibility conditions. The code will be released at https://github.com/zhiqi-li/BEVFormer.

show abstract

Section: Camera-based 3d Perceptionmentioning

confidence: 99%

Section: Ablation Studymentioning

confidence: 99%

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Li¹,

Wang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…However, these advancements in architecture-types have not addressed the issue of learning viewpointagnostic representation. Viewpoint-agnostic representation learning is drawing increasing attention in the vision community due to its wide range of downstream applications like 3D objectdetection [41],video alignment [6,16,17], action recognition [47,48], pose estimation [22,50], robot learning [24,26,43,45,49], and other tasks.…”

Section: Related Workmentioning

confidence: 99%

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Shang¹,

Das²,

Ryoo³

2022

Preprint

View full text Add to dashboard Cite

Humans are remarkably flexible in understanding viewpoint changes due to visual cortex supporting the perception of 3D structure. In contrast, most of the computer vision models that learn visual representation from a pool of 2D images often fail to generalize over novel camera viewpoints. Recently, the vision architectures have shifted towards convolution-free architectures, visual Transformers, which operate on tokens derived from image patches. However, neither these Transformers nor 2D convolutional networks perform explicit operations to learn viewpoint-agnostic representation for visual understanding. To this end, we propose a 3D Token Representation Layer (3DTRL) that estimates the 3D positional information of the visual tokens and leverages it for learning viewpoint-agnostic representations.The key elements of 3DTRL include a pseudo-depth estimator and a learned camera matrix to impose geometric transformations on the tokens. These enable 3DTRL to recover the 3D positional information of the tokens from 2D patches. In practice, 3DTRL is easily plugged-in into a Transformer. Our experiments demonstrate the effectiveness of 3DTRL in many vision tasks including image classification, multi-view video alignment, and action recognition. The models with 3DTRL outperform their backbone Transformers in all the tasks with minimal added computation. Our project page is at https://www3.cs.stonybrook.edu/ ~jishang/3dtrl/3dtrl.html.

show abstract

“…To overcome this challenge, [21] generates possible depth distributions for each pixel, which makes it able to place the context in the nearest pillar, rather than projecting along the entire ray. [26] projects the 2D image features into an aggregated 3D voxel representation using a pinhole camera model. The 3D representations are then processed by a standard CNN to predict BEV bounding boxes.…”

Section: Multiview Monocular Detectionmentioning

confidence: 99%

“…As a product of these public datasets, we have seen an explosion of DNN model development in literature (discussed in section 2), where we are seeing the bar being raised year over year on these tasks. However we often see that these cutting edge models usually only operate on one modality and one task such as in [21,26] or that they perform multi-modal sensor fusion in a way that the sensor inputs are entangled and non-modular [3,18,22,27,30].…”

Section: Introductionmentioning

confidence: 99%

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

Sidhu¹,

Wang²,

Naseer³

et al. 2021

Preprint

View full text Add to dashboard Cite

In autonomous driving, there has been an explosion in the use of deep neural networks for perception, prediction and planning tasks. As autonomous vehicles (AVs) move closer to production, multi-modal sensor inputs and heterogeneous vehicle fleets with different sets of sensor platforms are becoming increasingly common in the industry. However, neural network architectures typically target specific sensor platforms and are not robust to changes in input, making the problem of scaling and model deployment particularly difficult. Furthermore, most players still treat the problem of optimizing software and hardware as entirely independent problems. We propose a new end to end architecture, Generalized Sensor Fusion (GSF), which is designed in such a way that both sensor inputs and target tasks are modular and modifiable. This enables AV system designers to easily experiment with different sensor configurations and methods and opens up the ability to deploy on heterogeneous fleets using the same models that are shared across a large engineering organization. Using this system, we report experimental results where we demonstrate near-parity of an expensive high-density (HD) LiDAR sensor with a cheap low-density (LD) LiDAR plus camera setup in the 3D object detection task. This paves the way for the industry to jointly design hardware and software architectures as well as large fleets with heterogeneous configurations.

show abstract

ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

Cited by 133 publications

References 33 publications

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Learning Viewpoint-Agnostic Visual Representations by Recovering Tokens in 3D Space

Scalable Primitives for Generalized Sensor Fusion in Autonomous Vehicles

Contact Info

Product

Resources

About