2023
DOI: 10.1145/3592097
|View full text |Cite
|
Sign up to set email alerts
|

GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents

Abstract: The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control. We leverage the power of the large-scale Contrastive-Language-Image-Pre-training (CLIP) mod… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 64 publications
(7 citation statements)
references
References 57 publications
0
7
0
Order By: Relevance
“…For instance, DALLE [RDN*22] enables users to describe an image in terms of content and style. Similar applications can be observed in the computer animation field such as the Human Motion Diffusion Model (MDM) [TRG*23], Speech Gesture Diffusion [AZL23], etc.…”
Section: Introductionmentioning
confidence: 78%
“…For instance, DALLE [RDN*22] enables users to describe an image in terms of content and style. Similar applications can be observed in the computer animation field such as the Human Motion Diffusion Model (MDM) [TRG*23], Speech Gesture Diffusion [AZL23], etc.…”
Section: Introductionmentioning
confidence: 78%
“…Similarly, [8] conditions on CLIP latents, but combines latent space based and diffusion based motion generation. Most similar to our work is [3], which learns a gesture-text joint embedding using contrastive learning and a CLIP based style encoding module in a diffusion based gesture synthesis model.…”
Section: Using Language Based Pre-training Approaches In Motion Gener...mentioning
confidence: 99%
“…Diffusion models [15,34,35] have emerged as a notable and contemporary probabilistic generative modelling methodology. These models have shown promise in capturing complex data distributions and have gained attention in various fields, including gesture generation [2,3,30,45]. Inspired by these works our system uses Denoising Diffusion Probabilistic Modelling (DDPM) [15] formulation with self-supervised representations to synthesise gestures conditioned on the input audio.…”
Section: Related Work 21 Co-speech Gesture Generationmentioning
confidence: 99%
“…CLIP [11] trained language and image information jointly by minimizing the distance between corresponding image and text embeddings, achieving zero-shot performance on downstream tasks. Due to the impressive performance of visionlanguage models [39], our research focuses on improving the alignment between images and semantic information using these models. The objective is to provide diffusion models with more accurate guidance signals.…”
Section: Vision and Language (Vl) Modelsmentioning
confidence: 99%