2022
DOI: 10.48550/arxiv.2203.17270
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers

Abstract: 3D visual perception tasks, including 3D detection and map segmentation based on multi-camera images, are essential for autonomous driving systems. In this work, we present a new framework termed BEVFormer, which learns unified BEV representations with spatiotemporal transformers to support multiple autonomous driving perception tasks. In a nutshell, BEVFormer exploits both spatial and temporal information by interacting with spatial and temporal space through predefined grid-shaped BEV queries. To aggregate s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
115
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 28 publications
(117 citation statements)
references
References 46 publications
2
115
0
Order By: Relevance
“…Lift-Splat [23] learns categorical depth distributions in an unsupervised manner to generates the bird's eye view representations. With the trend of transformer, BEVFormer [13] learning bird's eye view representation from multi-camera images via spatiotemporal transformers to capture information across cameras. Bird's eye view based method which is based on depth information is sensitive to the accuracy of depth information.…”
Section: D Intermedia Representation Methodsmentioning
confidence: 99%
“…Lift-Splat [23] learns categorical depth distributions in an unsupervised manner to generates the bird's eye view representations. With the trend of transformer, BEVFormer [13] learning bird's eye view representation from multi-camera images via spatiotemporal transformers to capture information across cameras. Bird's eye view based method which is based on depth information is sensitive to the accuracy of depth information.…”
Section: D Intermedia Representation Methodsmentioning
confidence: 99%
“…As in common autonomous driving datasets, the objects in general move on flat ground, PointPillars [19] proposes to map the 3D features onto a bird's eye view 2D space to reduce the computational overhead. It soon becomes a de-facto standard in this domain [37,19,15,52,31,22]. Lift-Splat-Shoot (LSS) [32] uses depth estimation network to extract the implied depth information of multiperspective images and transform camera feature maps into 3D Ego-car coordinate.…”
Section: Related Workmentioning
confidence: 99%
“…In the early stage of perception system, people design separate deep models for each sensor [35,36,56,15,51], and fusing the information via post-processing approaches [30]. Note that, people discover that bird's eye view (BEV) has been an de-facto standard for autonomous driving scenarios as, generally speaking, car cannot fly [19,22,37,15,52,31]. However, it is often difficult to regress 3D bounding boxes on pure image inputs due to the lack of depth information, and similarly, it is difficult to classify objects on point clouds when LiDAR does not receive enough points.…”
Section: Introductionmentioning
confidence: 99%
“…3D detection has received extensive attention as one of the fundamental tasks in autonomous driving scenarios [41,30,42,29,11,14,27,7,39,22]. Recently, fusing the two common modalities, input from the camera and LiDAR sensors, has become a de-facto standard in the 3D detection domain as each modality has complementary information of the other [4,33,34,31,40,8,44].…”
Section: Introductionmentioning
confidence: 99%