CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Khalid, Nasir Mohammad; Xie, Tianhao; Belilovsky, Eugene; Popa, Tiberiu

doi:10.48550/arxiv.2203.13333

Cited by 3 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this approach necessitates 3D assets in voxel representations during training, which poses a challenge to scalability with data. Two recent works, namely DreamField (Jain et al 2022b) and CLIP-mesh (Khalid et al 2022b), address the issue of training data by utilizing a pretrained image-text model (Radford et al 2021a) to optimize the underlying 3D representations (NeRFs and meshes) such that all 2D renderings achieve high text-image alignment scores. In the realm of 3D synthesis, relying exclusively on pre-trained large-scale image-text models instead of costly 3D training data has become a popular methodology.…”

Section: Related Workmentioning

confidence: 99%

“…The smoothed versions of the data distribution can be obtained by integrating out the data density q(x) to compute the marginals q(z t ) = q(z t |x)q(x)dx. To ensure q(z t ) is close to the data density at the start of the process (σ 0 ∼ 0) and close to Gaussian at the end of the forward process (σ T ∼ 1), the coefficients α t and σ t are chosen with α 2 t = 1 − σ 2 t to preserve variance (Kingma et al 2021;Song et al 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Real3D: The Curious Case of Neural Scene Degeneration

Chen,

Hu,

Wei

et al. 2024

AAAI

View full text Add to dashboard Cite

Despite significant progress in utilizing pre-trained text-to-image diffusion models to guide the creation of 3D scenes, these methods often struggle to generate scenes that are sufficiently realistic, leading to "neural scene degeneration". In this work, we propose a new 3D scene generation model called Real3D. Specifically, Real3D designs a pipeline from a NeRF-like implicit renderer to a tetrahedrons-based explicit renderer, greatly improving the neural network's ability to generate various neural scenes. Moreover, Real3D introduces an additional discriminator to prevent neural scenes from falling into undesirable local optima, thus avoiding the degeneration phenomenon. Our experimental results demonstrate that Real3D outperforms all existing state-of-the-art text-to-3D generation methods, providing valuable insights to facilitate the development of learning-based 3D scene generation approaches.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Real3D: The Curious Case of Neural Scene Degeneration

Chen,

Hu,

Wei

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…NeRFs are neural representations of scenes that are entirely contained in the network, with no explicit graphics resources required. There has been a small amount of work done on explicit generation from text, where the generated content is separate from the model that created it [Khalid et al(2022)] [Chen et al(2023a]. In particular there is Point-E [Nichol et al(2022)], a 3D point cloud generation technique that is conditioned on text prompts.…”

Section: Generation From Textmentioning

confidence: 99%

Generating Parametric BRDFs from Natural Language Descriptions

Memery,

Cedron,

Subr

2023

Computer Graphics Forum

View full text Add to dashboard Cite

Artistic authoring of 3D environments is a laborious enterprise that also requires skilled content creators. There have been impressive improvements in using machine learning to address different aspects of generating 3D content, such as generating meshes, arranging geometry, synthesizing textures, etc. In this paper we develop a model to generate Bidirectional Reflectance Distribution Functions (BRDFs) from descriptive textual prompts. BRDFs are four dimensional probability distributions that characterize the interaction of light with surface materials. They are either represented parametrically, or by tabulating the probability density associated with every pair of incident and outgoing angles. The former lends itself to artistic editing while the latter is used when measuring the appearance of real materials. Numerous works have focused on hypothesizing BRDF models from images of materials. We learn a mapping from textual descriptions of materials to parametric BRDFs. Our model is first trained using a semi‐supervised approach before being tuned via an unsupervised scheme. Although our model is general, in this paper we specifically generate parameters for MDL materials, conditioned on natural language descriptions, within NVIDIA's Omniverse platform. This enables use cases such as real‐time text prompts to change materials of objects in 3D environments such as “dull plastic” or “shiny iron”. Since the output of our model is a parametric BRDF, rather than an image of the material, it may be used to render materials using any shape under arbitrarily specified viewing and lighting conditions.

show abstract

“…Besides, promptbased editing can be done through finetuning with LDM in the coarse-to-fine stage with the modified prompt. Inaccurate and unfaithful structures in text-to-3D generation due to random shape initialization without prior knowledge lead Dream3D [146] to explicit 3D shape priors into the CLIP-guided 3D optimization process [143,147,148]. Specifically, it connects the T2I model and a shape generator as the text-to-shape stage to produce a 3D shape prior with shape components in the prompts.…”

Section: Generating Data In Target Domainmentioning

confidence: 99%

Are Vision Transformers Robust to Patch Perturbations?

Gu¹,

Tresp²,

Qin

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Prompt engineering is a technique that involves augmenting a large pre-trained model with task-specific hints, known as prompts, to adapt the model to new tasks. Prompts can be created manually as natural language instructions or generated automatically as either natural language instructions or vector representations. Prompt engineering enables the ability to perform predictions based solely on prompts without updating model parameters, and the easier application of large pre-trained models in real-world tasks. In past years, Prompt engineering has been well-studied in natural language processing. Recently, it has also been intensively studied in vision-language modeling. However, there is currently a lack of a systematic overview of prompt engineering on pre-trained vision-language models. This paper aims to provide a comprehensive survey of cutting-edge research in prompt engineering on three types of vision-language models: multimodal-to-text generation models (e.g., Flamingo), image-text matching models (e.g., CLIP), and text-to-image generation models (e.g., Stable Diffusion). For each type of model, a brief model summary, prompting methods, prompting-based applications, and the corresponding responsibility and integrity issues are summarized and discussed. Furthermore, the commonalities and differences between prompting on vision-language models, language models, and vision models are also discussed. The challenges, future directions, and research opportunities are summarized to foster future research on this topic.

show abstract

CLIP-Mesh: Generating textured meshes from text using pretrained image-text models

Cited by 3 publications

References 9 publications

Real3D: The Curious Case of Neural Scene Degeneration

Real3D: The Curious Case of Neural Scene Degeneration

Generating Parametric BRDFs from Natural Language Descriptions

Are Vision Transformers Robust to Patch Perturbations?

Contact Info

Product

Resources

About