Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Zhang, Renrui; Gao, Peng; Fang, Rongyao; Zhao, Lei; Wang, Dong; Qiao, Yu; Li, Hongsheng

doi:10.48550/arxiv.2205.14401

Cited by 12 publications

(12 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Masked Image Modeling Inspired by BERT [11] for Masked Language Modeling, Masked Image Modeling (MIM) becomes a popular pretext task for visual representation learning [6,14,2,46,1,4,51,3,49]. MIM aims to reconstruct the masked tokens from a corrupted input.…”

Section: Related Workmentioning

confidence: 99%

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Liu¹,

Wang²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multicamera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation.

show abstract

Section: Related Workmentioning

confidence: 99%

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Liu¹,

Wang²,

Liu³

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Annotating point clouds demands significant effort, necessitating self-supervised pre-training methods. Prior approaches primarily focus on object CAD models [21,26,29,39,42,44] and indoor scenes [17,35,46]. Point-BERT [42] applies BERT-like paradigms for point cloud recognition, while Point-MAE [26] reconstructs point patches without the tokenizer.…”

Section: Related Workmentioning

confidence: 99%

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Xu¹,

Wang²,

Zhang³

et al. 2023

Preprint

View full text Add to dashboard Cite

This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based selfsupervised pre-training and a carefully designed dataefficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous dataefficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark will be available at https://github.com/SmartBot-PJLab/MV-JAR.

show abstract

“…Following masked autoencoder (MAE) [20], Point-MAE [37] reconstructs the coordinates of masked points. Point-M2AE [62] extends the MAE pipeline to hierarchical multi-scale networks. Mask-Point [28] models an implicit representation to avoid information leakage.…”

Section: Related Workmentioning

confidence: 99%

Applying Plain Transformers to Real-World Point Clouds

Li¹,

Heizmann²

2023

Preprint

View full text Add to dashboard Cite

Due to the lack of inductive bias, transformer-based models usually require a large amount of training data. The problem is especially concerning in 3D vision, as 3D data are harder to acquire and annotate. To overcome this problem, previous works modify the architecture of transformers to incorporate inductive biases by applying, e.g., local attention and down-sampling. Although they have achieved promising results, earlier works on transformers for point clouds have two issues. First, the power of plain transformers is still under-explored. Second, they focus on simple and small point clouds instead of complex real-world ones. This work revisits the plain transformers in real-world point cloud understanding. We first take a closer look at some fundamental components of plain transformers, e.g., patchifier and positional embedding, for both efficiency and performance. To close the performance gap due to the lack of inductive bias and annotated data, we investigate selfsupervised pre-training with masked autoencoder (MAE). Specifically, we propose drop patch, which prevents information leakage and significantly improves the effectiveness of MAE. Our models achieve SOTA results in semantic segmentation on the S3DIS dataset and object detection on the ScanNet dataset with lower computational costs. Our work provides a new baseline for future research on transformers for point clouds.

show abstract

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Cited by 12 publications

References 42 publications

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding

MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training

Applying Plain Transformers to Real-World Point Clouds

Contact Info

Product

Resources

About