2022
DOI: 10.48550/arxiv.2203.13333
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Nasir Mohammad Khalid,
Tianhao Xie,
Eugene Belilovsky
et al.

Abstract: Figure 1. A 3D scene composed of objects generated using only text prompts: lamp shade, round brown table, photograph of a bust of homer, vase with pink flowers, blue sofa, pink pillow, painting in a frame, brown table, apple, banana, muffin, loaf of bread, coffee, burger, fruit basket, coca cola can, red chair, computer monitor, photo of marios cap, playstation one controller, blue pen, excalibur sword, matte painting of a bonsai tree; trending on artstation.

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 9 publications
0
4
0
Order By: Relevance
“…However, this approach necessitates 3D assets in voxel representations during training, which poses a challenge to scalability with data. Two recent works, namely DreamField (Jain et al 2022b) and CLIP-mesh (Khalid et al 2022b), address the issue of training data by utilizing a pretrained image-text model (Radford et al 2021a) to optimize the underlying 3D representations (NeRFs and meshes) such that all 2D renderings achieve high text-image alignment scores. In the realm of 3D synthesis, relying exclusively on pre-trained large-scale image-text models instead of costly 3D training data has become a popular methodology.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…However, this approach necessitates 3D assets in voxel representations during training, which poses a challenge to scalability with data. Two recent works, namely DreamField (Jain et al 2022b) and CLIP-mesh (Khalid et al 2022b), address the issue of training data by utilizing a pretrained image-text model (Radford et al 2021a) to optimize the underlying 3D representations (NeRFs and meshes) such that all 2D renderings achieve high text-image alignment scores. In the realm of 3D synthesis, relying exclusively on pre-trained large-scale image-text models instead of costly 3D training data has become a popular methodology.…”
Section: Related Workmentioning
confidence: 99%
“…The smoothed versions of the data distribution can be obtained by integrating out the data density q(x) to compute the marginals q(z t ) = q(z t |x)q(x)dx. To ensure q(z t ) is close to the data density at the start of the process (σ 0 ∼ 0) and close to Gaussian at the end of the forward process (σ T ∼ 1), the coefficients α t and σ t are chosen with α 2 t = 1 − σ 2 t to preserve variance (Kingma et al 2021;Song et al 2020).…”
Section: Introductionmentioning
confidence: 99%
“…NeRFs are neural representations of scenes that are entirely contained in the network, with no explicit graphics resources required. There has been a small amount of work done on explicit generation from text, where the generated content is separate from the model that created it [Khalid et al(2022)] [Chen et al(2023a]. In particular there is Point-E [Nichol et al(2022)], a 3D point cloud generation technique that is conditioned on text prompts.…”
Section: Generation From Textmentioning
confidence: 99%
“…Besides, promptbased editing can be done through finetuning with LDM in the coarse-to-fine stage with the modified prompt. Inaccurate and unfaithful structures in text-to-3D generation due to random shape initialization without prior knowledge lead Dream3D [146] to explicit 3D shape priors into the CLIP-guided 3D optimization process [143,147,148]. Specifically, it connects the T2I model and a shape generator as the text-to-shape stage to produce a 3D shape prior with shape components in the prompts.…”
Section: Generating Data In Target Domainmentioning
confidence: 99%