2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022
DOI: 10.1109/cvpr52688.2022.01805
|View full text |Cite
|
Sign up to set email alerts
|

CLIP-Forge: Towards Zero-Shot Text-to-Shape Generation

Abstract: Sketch-A-Shape is a zero-shot sketch-to-3D generative model. Here we show how our method can generalize across voxel, implicit, and CAD representations and synthesize consistent 3D shapes from a variety of inputs ranging from casual doodles to professional sketches with different levels of ambiguity.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
38
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 138 publications
(38 citation statements)
references
References 58 publications
0
38
0
Order By: Relevance
“…The fully-supervised method [6,8,25] uses ground truth text and the paired 3D objects with explicit 3D representations as training data. Specifically, CLIP-Forge [43] uses a two-stage training scheme, which consists of shape autoencoder training, and conditional normalizing flow training. VQ-VAE [44] performs zero-shot training with 3D voxel data by utilizing the pretrained CLIP model [37].…”
Section: Text-guided 3d Shape Generationmentioning
confidence: 99%
See 2 more Smart Citations
“…The fully-supervised method [6,8,25] uses ground truth text and the paired 3D objects with explicit 3D representations as training data. Specifically, CLIP-Forge [43] uses a two-stage training scheme, which consists of shape autoencoder training, and conditional normalizing flow training. VQ-VAE [44] performs zero-shot training with 3D voxel data by utilizing the pretrained CLIP model [37].…”
Section: Text-guided 3d Shape Generationmentioning
confidence: 99%
“…To demonstrate our 3D generation quality, we compare existing works [25,43,50] from three aspects: 1) rendered 2D image quality using FID, 2) text-image relevance using R-Precision, and 3) 3D geometry quality using FPD. In Tab.…”
Section: Quantitative Comparisonmentioning
confidence: 99%
See 1 more Smart Citation
“…It encodes text and 3D shapes separately into the same latent space. However, large-scale 3D-text datasets are still difficult to obtain, so ClipForge [22] bypasses this problem with the aid of the CLIP model on text-image matching. CLIP-Mesh [23] also uses the CLIP model to measure the matching degree between the image rendered by the grid model and the text, so as to optimize the entire model parameters.…”
Section: Introductionmentioning
confidence: 99%
“…An increasingly growing area of research is the creation of 3D models of objects based on text descriptions. Depending on the approach, these objects are created as point cloud (Yang et al, 2019;Achlioptas et al, 2018), voxel (Sanghi et al, 2022;Chen et al, 2018), mesh (Nash et al, 2020) or implicite representation Mescheder et al, 2019). Or alternatively get the most suitable object from an object database (Textto-shape retrieval; Ruan et al, 2022).…”
Section: Text-to-shape Generationmentioning
confidence: 99%