2022
DOI: 10.1145/3528223.3530094
|View full text |Cite
|
Sign up to set email alerts
|

AvatarCLIP

Abstract: 3D avatar creation plays a crucial role in the digital age. However, the whole production process is prohibitively time-consuming and labor-intensive. To democratize this technology to a larger audience, we propose AvatarCLIP, a zero-shot text-driven framework for 3D avatar generation and animation. Unlike professional software that requires expert knowledge, AvatarCLIP empowers layman users to customize a 3D avatar with the desired shape and texture, and drive the avatar with the described motions using solel… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
29
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 144 publications
(29 citation statements)
references
References 46 publications
0
29
0
Order By: Relevance
“…In addition to image-to-text conversion, Hong et al designed a zero-shot text-driven 3D avatar generation and animation, named Avatar Contrastive Language-Image Pre-Training (AvatarCLIP). [110] As shown in Figure 3e, AvatarCLIP can create a customized 3D avatar following the users' expected shape and texture and make the avatar follow the predefined motions using desired text. Specifically, the generated 3D human geometry is initialized from shapes driven by natural language descriptions through a Variational Autoencoder (VAE) network.…”
Section: Advanced Image Sensorsmentioning
confidence: 99%
“…In addition to image-to-text conversion, Hong et al designed a zero-shot text-driven 3D avatar generation and animation, named Avatar Contrastive Language-Image Pre-Training (AvatarCLIP). [110] As shown in Figure 3e, AvatarCLIP can create a customized 3D avatar following the users' expected shape and texture and make the avatar follow the predefined motions using desired text. Specifically, the generated 3D human geometry is initialized from shapes driven by natural language descriptions through a Variational Autoencoder (VAE) network.…”
Section: Advanced Image Sensorsmentioning
confidence: 99%
“…Diffusion Generative models have achieved impressive success in a wide variety of computer vision tasks such as image inpainting [31], text-to-image generation [30], and image-to-image translation [4]. Given the strong capability to bridge the large gap between highly uncertain and determinate distribution, several works have utilized the diffusion generative model for the text-to-motion generation [43,33,42]. Zhang et al [42] propose a versatile motiongeneration framework that incorporates a diffusion model to generate diverse motions from comprehensive texts.…”
Section: Diffusion Generative Modelsmentioning
confidence: 99%
“…Given the strong capability to bridge the large gap between highly uncertain and determinate distribution, several works have utilized the diffusion generative model for the text-to-motion generation [43,33,42]. Zhang et al [42] propose a versatile motiongeneration framework that incorporates a diffusion model to generate diverse motions from comprehensive texts. Similarly, Tevet et al [33] introduce a lightweight transformerbased diffusion generative model that can achieve text-tomotion and motion editing.…”
Section: Diffusion Generative Modelsmentioning
confidence: 99%
“…In contrast, our work uses a transformer architecture to learn temporal correlations over sequences of shapes. Transformers for generating sequences of human bodies has also been recently explored by Song et al [SWJ*22] who concentrate on a multi‐person skeleton generation use case and by Hong et al [HZP*22] for the generation of human body animations from text input. The recent work by Petrovich et al [PBV21] is closest in spirit to ours: it introduces Actor , a transformer variational autoencoder for action‐conditioned generation of human body poses.…”
Section: Related Workmentioning
confidence: 99%