Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

Nian, Fuzhong; Tang, Lulu; Rao, Yongming; Huang, Tiejun; Zhou, Jie; Lu, Jiwen

doi:10.48550/arxiv.2111.14819

Cited by 12 publications

(36 citation statements)

References 62 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The series of GPT [37,38,5] and BERT [12] apply masked modeling to natural language processing and achieve extraordinary performance boost on downstream tasks with fine-tuning. Inspired by this, BEiT [4] proposes to match image patches with discrete tokens via dVAE [39] and pre-train a standard vision transformer [14,59] by masked image modeling. On top of that, MAE [20] directly reconstructs the raw pixel values of masked tokens and performs great efficiency with a high mask ratio.…”

Section: Related Workmentioning

confidence: 99%

“…For self-supervised pre-training on 3D point clouds, the masked autoencoding has not been widely adopted. Similar to BEiT, Point-BERT [59] utilizes dVAE to map 3D patches to tokens for masked point modeling, but heavily relies on constrastive learning [21], complicated data augmentation, and the costly two-stage pre-training. In contrast, our Point-M2AE is a pure masked autoencoding method of one-stage pre-training, and follows MAE to reconstruct the input signals without dVAE mapping.…”

Section: Related Workmentioning

confidence: 99%

“…CrossPoint [2] conducts cross-modality contrastive learning between point clouds and their corresponding rendering images to acquire rich self-supervised signals. Point-BERT [59] first introduces BERT-style pre-training for 3D point clouds with a standard transformer network and performs competitively on various downstream tasks. In this paper, we propose an MAE-style [20] pre-training framework, Point-M2AE, which reconstructs the highly masked 3D coordinates of the input point cloud for self-supervised learning.…”

Section: Related Workmentioning

confidence: 99%

“…Motivated by this, Vision Transformers [14] and DETR [6] introduce the transformer architecture into computer vision, and stimulate follow-up works to effectively apply transformers to a wide range of vision tasks, such as image classification [43,30,31], object detection [66,64,16], semantic segmentation [53] and so on [26]. For 3D understanding, transformer-based networks are also adopted for shape classification and part segmentation [17,59,63], 3D object detection from point clouds [32] and monocular images [61], and point cloud completion [58]. As a pioneer work, PCT [17] utilizes neighbor embedding layers to aggregate local features and processes the downsampled point clouds by transformer blocks.…”

Section: A Related Workmentioning

confidence: 99%

See 3 more Smart Citations

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Zhang¹,

Gao²,

Fang³

et al. 2022

Preprint

View full text Add to dashboard Cite

Masked Autoencoders (MAE) have shown great potentials in self-supervised pretraining for language and 2D image transformers. However, it still remains an open question on how to exploit masked autoencoding for learning 3D representations of irregular point clouds. In this paper, we propose Point-M2AE, a strong Multi-scale MAE pre-training framework for hierarchical self-supervised learning of 3D point clouds. Unlike the standard transformer in MAE, we modify the encoder and decoder into pyramid architectures to progressively model spatial geometries and capture both fine-grained and high-level semantics of 3D shapes. For the encoder that downsamples point tokens by stages, we design a multi-scale masking strategy to generate consistent visible regions across scales, and adopt a local spatial self-attention mechanism to focus on neighboring patterns. By multi-scale token propagation, the lightweight decoder gradually upsamples point tokens with complementary skip connections from the encoder, which further promotes the reconstruction from a global-to-local perspective. Extensive experiments demonstrate the state-of-the-art performance of Point-M2AE for 3D representation learning. With a frozen encoder after pre-training, Point-M2AE achieves 92.9% accuracy for linear SVM on ModelNet40, even surpassing some fully trained methods. By fine-tuning on downstream tasks, Point-M2AE achieves 86.43% accuracy on ScanObjectNN,+3.36% to the second-best, and largely benefits the few-shot classification, part segmentation and 3D object detection with the hierarchical pre-training scheme. Code will be available at https://github.com/ZrrSkywalker/Point-M2AE.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: A Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Zhang¹,

Gao²,

Fang³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Since our main goal is not to develop a general backbone for point clouds, we simply work with a standard transformer for shape autoencoding (first-stage training). Similarly, both PointBERT [52] and PointMAE [28] use standard transformers for point cloud self-supervised learning.…”

Section: Neural Shape Representationsmentioning

confidence: 99%

3DILG: Irregular Latent Grids for 3D Generative Modeling

Zhang¹,

Nießner²,

Wonka³

2022

Preprint

View full text Add to dashboard Cite

We propose a new representation for encoding 3D shapes as neural fields. The representation is designed to be compatible with the transformer architecture and to benefit both shape reconstruction and shape generation. Existing works on neural fields are grid-based representations with latents defined on a regular grid. In contrast, we define latents on irregular grids, enabling our representation to be sparse and adaptive. In the context of shape reconstruction from point clouds, our shape representation built on irregular grids improves upon grid-based methods in terms of reconstruction accuracy. For shape generation, our representation promotes high-quality shape generation using auto-regressive probabilistic models. We show different applications that improve over the current state of the art. First, we show results for probabilistic shape reconstruction from a single higher resolution image. Second, we train a probabilistic model conditioned on very low resolution images. Third, we apply our model to category-conditioned generation. All probabilistic experiments confirm that we are able to generate detailed and high quality shapes to yield the new state of the art in generative 3D shape modeling.Preprint. Under review.

show abstract

UAV‐derived greenness and within‐crown spatial patterning can detect ash dieback in individual trees

Flynn,

Grieve,

Henshaw

et al. 2024

Ecol Sol and Evidence

View full text Add to dashboard Cite

Ash Dieback (ADB) has been present in the UK since 2012 and is expected to kill up to 80% of UK ash trees. Detecting and quantifying the extent of ADB in individual tree crowns (ITCs), which is crucial to understanding resilience and resistance, currently relies on visual assessments which are impractical over large scales or at high frequency. The improved imaging capabilities and declining cost of consumer UAVs, together with new remote sensing methods such as structure from motion photogrammetry (SfM) offers potential to quantify the fine‐scale structural and spectral metrics of ITCs that are indicative of ADB, rapidly, and at low‐cost. We extract high‐resolution 3D RGB point clouds derived from SfM of canopy ash trees taken monthly throughout the growing season at Marden Park, Surrey, UK, a woodland impacted by ADB. We segment ITCs, extract green chromatic coordinate (gcc), and test the relationship with visual assessments of crown health. Next, we quantify spatial patterning of dieback within ITCs by testing the relationship between internal variation of gcc and path length, a measure of the distance from foliage to trunk, for small clusters of foliage. We find gcc correlates with visual assessments of crown health throughout the growing season, but the strongest relationships are in measurements taken after peak greenness, when the effects of ADB on foliage are likely to be most prevalent. We also find a negative relationship between gcc and path length in infected trees, indicating foliage loss is more severe at crown extremities. We demonstrate a new method for identifying ADB at scale using a consumer‐grade 3D RGB UAV system and suggest this approach could be adopted for widespread rapid monitoring. We recommend the optimum time of year for data acquisition, which we find to be an important factor for detecting ADB. Although here applied to ADB, this framework is applicable to a multitude of drivers of crown dieback, presenting a method for identifying spectral‐structural relationships which may be characteristic of disturbance type.

show abstract

Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling

Cited by 12 publications

References 62 publications

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

Point-M2AE: Multi-scale Masked Autoencoders for Hierarchical Point Cloud Pre-training

3DILG: Irregular Latent Grids for 3D Generative Modeling

UAV‐derived greenness and within‐crown spatial patterning can detect ash dieback in individual trees

Contact Info

Product

Resources

About