2022
DOI: 10.1007/978-3-031-20077-9_1
|View full text |Cite
|
Sign up to set email alerts
|

BEVFormer: Learning Bird’s-Eye-View Representation from Multi-camera Images via Spatiotemporal Transformers

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
176
1
2

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 553 publications
(179 citation statements)
references
References 41 publications
0
176
1
2
Order By: Relevance
“…The significant improvement on mAP suggests MV2D has a strong capability at localizing objects in 3D space. In comparison with multi-view 3D object detection methods, MV2D with ResNet-101 outperforms BEV based methods BEVFormer-S [20] by 3.1% and 0.3% on mAP and NDS, and outperforms BEVDepth [19] with extra depth supervision by 3.0% and 4.3% on mAP and NDS. In comparison with query based methods, MV2D outperforms the best performing PETR [24] by 3.6% and 0.9% on mAP and NDS.…”
Section: Comparison With State-of-the-artsmentioning
confidence: 96%
See 3 more Smart Citations
“…The significant improvement on mAP suggests MV2D has a strong capability at localizing objects in 3D space. In comparison with multi-view 3D object detection methods, MV2D with ResNet-101 outperforms BEV based methods BEVFormer-S [20] by 3.1% and 0.3% on mAP and NDS, and outperforms BEVDepth [19] with extra depth supervision by 3.0% and 4.3% on mAP and NDS. In comparison with query based methods, MV2D outperforms the best performing PETR [24] by 3.6% and 0.9% on mAP and NDS.…”
Section: Comparison With State-of-the-artsmentioning
confidence: 96%
“…ImVoxelNet [32] builds a 3D voxelized space and samples image features from multi-view to obtain the voxel representation. BEV-Former [20] leverages dense BEV queries to project and aggregate features from multi-view images by deformable attention [44]. BEVDet and BEVDepth [14,19] adopt the Lift-Splat module [29] to transform multi-view image features into the BEV representation based on the predicted depth distribution.…”
Section: Vision-based 3d Object Detectionmentioning
confidence: 99%
See 2 more Smart Citations
“…M2BEV [64] also investigates the viability of simultaneously running multi-tasks perception based on BEV features. BEVFormer [65] proposes a spatiotemporal transformer that aggregates BEV features from current and previous features via deformable attention [66]. Compared to object detection, semantic scene completion can provide occupancy for each small cell instead of assigning a fixed-size bounding box to an object.…”
Section: Related Workmentioning
confidence: 99%