International Conference on Multimodal Interaction 2023
DOI: 10.1145/3577190.3616117
|View full text |Cite
|
Sign up to set email alerts
|

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Anna Deichler,
Shivam Mehta,
Simon Alexanderson
et al.

Abstract: This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis mo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 11 publications
(1 citation statement)
references
References 36 publications
0
1
0
Order By: Relevance
“…For example, gesture generation approaches use speech as a condition along with text for style; the attention mechanism is leveraged to synchronise the gestures to the speech. The approach of Deichler et al [DMAB23] proposes a contrastive speech and motion pre‐training module that learns joint embedding of speech and gesture; it learns a semantic coupling between these modalities. DiffuseStyleGesture [YWL*23] is an audio‐driven co‐gesture generation approach that synthesises gestures matching the music rhythm and text descriptions based on cross‐local and self‐attention mechanisms.…”
Section: Towards 4d Spatio‐temporal Diffusionmentioning
confidence: 99%
“…For example, gesture generation approaches use speech as a condition along with text for style; the attention mechanism is leveraged to synchronise the gestures to the speech. The approach of Deichler et al [DMAB23] proposes a contrastive speech and motion pre‐training module that learns joint embedding of speech and gesture; it learns a semantic coupling between these modalities. DiffuseStyleGesture [YWL*23] is an audio‐driven co‐gesture generation approach that synthesises gestures matching the music rhythm and text descriptions based on cross‐local and self‐attention mechanisms.…”
Section: Towards 4d Spatio‐temporal Diffusionmentioning
confidence: 99%