Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection

Zhu, Benjin; Jiang, Zi‐Tao; Zhou, Xiangxin; Li, Zeming; Yu, Gang

doi:10.48550/arxiv.1908.09492

Cited by 101 publications

(167 citation statements)

References 23 publications

Supporting

Mentioning

166

Contrasting

Order By: Relevance

“…Training Parameters Models are trained with AdamW [25] optimizer, in which gradient clip is exploited with learning rate 2e-4, a total batch size of 64 on 8 NVIDIA [48], all models are trained with CBGS [54]. In testing time, the input image is scaled by factor of 0.48 and cropped to 704×256 resolution with a region of (x 1 , x 2 , y 1 , y 2 ) = (32, 736, 176, 432).…”

Section: Experimental Settingsmentioning

confidence: 99%

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Huang¹,

Guan²,

Zhu³

et al. 2021

Preprint

168

View full text Add to dashboard Cite

Autonomous driving perceives the surrounding environment for decision making, which is one of the most complicated scenes for visual perception. The great power of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet is developed by following the principle of detecting the 3D objects in Bird-Eye-View (BEV), where route planning can be handily performed. In this paradigm, four kinds of modules are conducted in succession with different roles: an image-view encoder for encoding feature in image view, a view transformer for feature transformation from image view to BEV, a BEV encoder for further encoding feature in BEV, and a task-specific head for predicting the targets in BEV. We merely reuse the existing modules for constructing BEVDet and make it feasible for multi-camera 3D object detection by constructing an exclusive data augmentation strategy. The proposed paradigm works well in multi-camera 3D object detection and offers a good trade-off between computing budget and performance. BEVDet with 704×256 (1/8 of the competitors) image size scores 29.4% mAP and 38.4% NDS on the nuScenes val set, which is comparable with FCOS3D (i.e., 2008.2 GFLOPs, 1.7 FPS, 29.5% mAP, and 37.2% NDS), while requires just 12% computing budget of 239.4 GFLOPs and runs 4.3 times faster. Scaling up the input size to 1408×512, BEVDet scores 34.9% mAP and 41.7% NDS, which requires just 601.4 GFLOPs and significantly suppresses FCOS3D by 5.4% mAP and 4.5% NDS. The superiority of BEVDet tells the magic of paradigm innovation.

show abstract

Section: Experimental Settingsmentioning

confidence: 99%

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Huang¹,

Guan²,

Zhu³

et al. 2021

Preprint

168

View full text Add to dashboard Cite

show abstract

“…We evaluated Fixy on two AV perception datasets: an internal dataset from our research organization and the publicly available Lyft Level 5 perception dataset [13]. The Lyft dataset has been used to develop models [33] and host competitions [27]. Both datasets consists of many scenes of LIDAR and camera data that were densely labeled with 3D bounding boxes by leading external vendors for human labels ("human-proposed labels").…”

Section: Methodsmentioning

confidence: 99%

“…Observation sources. We used three sources of observations over the data: human-proposed labels, LIDAR ML model predictions [16,33], and expert auditor labels. All sources predict 3D bounding boxes.…”

Section: Methodsmentioning

confidence: 99%

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Kang¹,

Aréchiga²,

Pillai³

et al. 2022

Preprint

View full text Add to dashboard Cite

ML is being deployed in complex, real-world scenarios where errors have impactful consequences. In these systems, thorough testing of the ML pipelines is critical. A key component in ML deployment pipelines is the curation of labeled training data. Common practice in the ML literature assumes that labels are the ground truth. However, in our experience in a large autonomous vehicle development center, we have found that vendors can often provide erroneous labels, which can lead to downstream safety risks in trained models.To address these issues, we propose a new abstraction, learned observation assertions, and implement it in a system called Fixy. Fixy leverages existing organizational resources, such as existing (possibly noisy) labeled datasets or previously trained ML models, to learn a probabilistic model for finding errors in human-or modelgenerated labels. Given user-provided features and these existing resources, Fixy learns feature distributions that specify likely and unlikely values (e.g., that a speed of 30mph is likely but 300mph is unlikely). It then uses these feature distributions to score labels for potential errors. We show that Fixy can automatically rank potential errors in real datasets with up to 2× higher precision compared to recent work on model assertions and standard techniques such as uncertainty sampling.

show abstract

“…For 3D detection, we use the same VoxelNet [75] and PointPillars [23] architectures following [23,66,76]. For VoxelNet, the detection range is [−54m, 54m] for the X, Y axis and [−5m, 3m] for the Z axis while the range is [−51.2m, 51.2m] for the X, Y axis for the PointPillar architecture.…”

Section: Methodsmentioning

confidence: 99%

Multimodal Virtual Point 3D Detection

Yin¹

2021

Preprint

View full text Add to dashboard Cite

Lidar-based sensing drives current autonomous vehicles. Despite rapid progress, current Lidar sensors still lag two decades behind traditional color cameras in terms of resolution and cost. For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two. This is an issue, especially when these objects turn out to be driving hazards. On the other hand, these same objects are clearly visible in onboard RGB sensors. In this work, we present an approach to seamlessly fuse RGB sensors into Lidar-based 3D recognition. Our approach takes a set of 2D detections to generate dense 3D virtual points to augment an otherwise sparse 3D point cloud. These virtual points naturally integrate into any standard Lidar-based 3D detectors along with regular Lidar measurements. The resulting multi-modal detector is simple and effective. Experimental results on the large-scale nuScenes dataset show that our framework improves a strong CenterPoint baseline by a significant 6.6 mAP, and outperforms competing fusion approaches. Code and more visualizations are available at https://tianweiy.github.io/mvp/.

show abstract

Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection

Cited by 101 publications

References 23 publications

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Finding Label and Model Errors in Perception Data With Learned Observation Assertions

Multimodal Virtual Point 3D Detection

Contact Info

Product

Resources

About