AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

Hong, Fangzhou; Zhang, Mingyuan; Peng, Liang; Cai, Zhongang; Yang, Liu; Liu, Ziwei

doi:10.48550/arxiv.2205.08535

Cited by 13 publications

(21 citation statements)

References 49 publications

(66 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…2022) takes single-view in-the-wild images in training; as their data has not been released, we only evaluate our method on some of their categories. To evaluate the performance, we employ Fréchet Inception Distance (FID) Heusel et al (2017), Fréchet Point Distance (FPD) Liu et al (2022) to measure shape generation quality, and conduct a human perceptual evaluation to further assess text-shape consistency.…”

Section: Text-guided Stylizationmentioning

confidence: 99%

“…To measure the shape generation quality, we employ Fréchet Inception Distance (FID) Heusel et al (2017) between five rendered images of the generated shape with different camera poses and a set of ground-truth ShapeNet or CO3D images. Further, we convert the generated shapes to 3D point clouds and adopt the metric Fréchet Point Distance (FPD) proposed in Liu et al (2022) to evaluate the generative quality. Note that Dream Field Jain et al (2022) does not produce 3D shapes directly, so that we cannot evaluate this work in this regard.…”

Section: B Implementation Details Metrics and Human Perceptual Evalua...mentioning

confidence: 99%

“…In addition, we evaluate the diversified generation results discussed in Section 3.3 in the main paper. Specifically, we generate additional two samples per input text, and then adopt FID Heusel et al (2017), FPD Liu et al (2022) (the lower, the better) for the fidelity evaluation and Point Score (PS) Liu et al (2022) (the higher, the better) for the diversity evaluation. The results are: FID: 113.42, FPD: 33.66, PS: 3.58, which is even better than our one-text-one-shape generative results (FID: 128.68,FPD: 36.03,PS: 3.33).…”

Section: C4 Diversified Generationmentioning

confidence: 99%

“…This is partially due to the limited representative capability of a single CLIP feature for such a long sentence. We may incorporate an additional local feature like Liu et al (2022) to handle the long text in the future.…”

Section: G Failure Casesmentioning

confidence: 99%

“…Existing works (Chen et al (2018); Jahan et al (2021); Liu et al (2022)) typically rely on paired text-shape data for model training. Yet, collecting 3D shapes is already very challenging on its own, let alone the tedious manual annotations needed to construct the text-shape pairs.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

Liu¹,

Dai²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape dataset, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a text-guided shape stylization module to dress up the output shapes with novel structures and textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures.A green bedside lamp.

show abstract

Section: Text-guided Stylizationmentioning

confidence: 99%

Section: B Implementation Details Metrics and Human Perceptual Evalua...mentioning

confidence: 99%

Section: C4 Diversified Generationmentioning

confidence: 99%

Section: G Failure Casesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

Liu¹,

Dai²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

GTOC 11: Results from Tsinghua University and Shanghai Institute of Satellite Engineering

Zhang

Xiang

et al. 2023

Acta Astronautica

View full text Add to dashboard Cite

Fast 3D Stylized Gaussian Portrait Generation From a Single Image With Style Aligned Sampling Loss

Jiang,

Yu,

Guo

et al. 2024

IEEE Access

View full text Add to dashboard Cite

Creating stylized 3D avatars and portraits from just a single image input is an emerging challenge in augmented and virtual reality. While prior work has explored 2D stylization or 3D avatar generation, achieving high-fidelity 3D stylized portraits with text control remains an open problem. In this paper, we present an efficient approach for generating high-quality 3D stylized portraits directly from a single input image. Our core representations are based on 3D Gaussian Splatting for efficient rendering, along with a surface-guided splitting and cloning strategy to reduce noise. To achieve high-fidelity stylized results, we introduce a Stylized Generation Module with a Style-Aligned Sampling Loss that injects the input image's identity information into the diffusion model while stabilizing the stylization process. Furthermore, we incorporate a multi-view diffusion model to enforce 3D consistency by generating multiple viewpoints. Extensive experimentation demonstrates that our approach outperforms existing methods in terms of stylization quality, 3D consistency, and user preference ratings. Our framework enables casual users to easily generate stylized 3D portraits through simple image or text inputs, facilitating engaging experiences in AR/VR applications.

show abstract

AvatarCLIP: Zero-Shot Text-Driven Generation and Animation of 3D Avatars

Cited by 13 publications

References 49 publications

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

GTOC 11: Results from Tsinghua University and Shanghai Institute of Satellite Engineering

Fast 3D Stylized Gaussian Portrait Generation From a Single Image With Style Aligned Sampling Loss

Contact Info

Product

Resources

About