2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2022
DOI: 10.1109/wacv51458.2022.00133
|View full text |Cite
|
Sign up to set email alerts
|

ImVoxelNet: Image to Voxels Projection for Monocular and Multi-View General-Purpose 3D Object Detection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
59
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 133 publications
(59 citation statements)
references
References 33 publications
0
59
0
Order By: Relevance
“…Methods transform image features into BEV features with the depth information from depth estimation [46] or categorical depth distribution [34]. OFT [36] and ImVoxelNet [37] project the predefined voxels onto image features to generate the voxel representation of the scene.…”
Section: Camera-based 3d Perceptionmentioning
confidence: 99%
See 1 more Smart Citation
“…Methods transform image features into BEV features with the depth information from depth estimation [46] or categorical depth distribution [34]. OFT [36] and ImVoxelNet [37] project the predefined voxels onto image features to generate the voxel representation of the scene.…”
Section: Camera-based 3d Perceptionmentioning
confidence: 99%
“…(1) Using the global attention to replace deformable attention; (2) Making each query only interact with its reference points rather than the surrounding local regions, and it is similar to previous methods [36,37]. For a broader comparison, we also replace the BEVFormer with the BEV generation methods proposed by VPN [30] and Lift-Spalt [32].…”
Section: Ablation Studymentioning
confidence: 99%
“…However, these advancements in architecture-types have not addressed the issue of learning viewpointagnostic representation. Viewpoint-agnostic representation learning is drawing increasing attention in the vision community due to its wide range of downstream applications like 3D objectdetection [41],video alignment [6,16,17], action recognition [47,48], pose estimation [22,50], robot learning [24,26,43,45,49], and other tasks.…”
Section: Related Workmentioning
confidence: 99%
“…To overcome this challenge, [21] generates possible depth distributions for each pixel, which makes it able to place the context in the nearest pillar, rather than projecting along the entire ray. [26] projects the 2D image features into an aggregated 3D voxel representation using a pinhole camera model. The 3D representations are then processed by a standard CNN to predict BEV bounding boxes.…”
Section: Multiview Monocular Detectionmentioning
confidence: 99%
“…As a product of these public datasets, we have seen an explosion of DNN model development in literature (discussed in section 2), where we are seeing the bar being raised year over year on these tasks. However we often see that these cutting edge models usually only operate on one modality and one task such as in [21,26] or that they perform multi-modal sensor fusion in a way that the sensor inputs are entangled and non-modular [3,18,22,27,30].…”
Section: Introductionmentioning
confidence: 99%