2023
DOI: 10.48550/arxiv.2301.05339
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

A Comprehensive Review of Data-Driven Co-Speech Gesture Generation

Abstract: Figure 1: Co-speech gesture generation approaches can be divided into rule-based and data-driven. Rule-based systems use carefully designed heuristics to associate speech with gesture (Section 4). Data-driven approaches associate speech and gesture through statistical modeling (Section 5.2), or by learning multimodal representations using deep generative models (Section 5.3). The main input modalities are speech audio in an intermediate representation; text transcript of speech; humanoid pose in joint position… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(12 citation statements)
references
References 57 publications
0
12
0
Order By: Relevance
“…We found the hand/finger quality of the existing mocap datasets [56] is not good enough, especially when retargeted to an avatar. Datasets claimed with high-quality hand motion capture were still reported to have poor hand motion [56], e.g., ZEGGS Dataset in [5] and Talking With Hands in [90]. So we ignore hand/finger motion currently, and leave it to future work.…”
Section: Discussionmentioning
confidence: 90%
See 2 more Smart Citations
“…We found the hand/finger quality of the existing mocap datasets [56] is not good enough, especially when retargeted to an avatar. Datasets claimed with high-quality hand motion capture were still reported to have poor hand motion [56], e.g., ZEGGS Dataset in [5] and Talking With Hands in [90]. So we ignore hand/finger motion currently, and leave it to future work.…”
Section: Discussionmentioning
confidence: 90%
“…We perform the training and evaluation on the Trinity [16] and ZEGGS [19] datasets. Even based on motion capture, the hand quality is still low [5,56,90], so we ignore hand motion currently. Then the number of joints for the two datasets is 𝐽 𝐴 = 26 and 𝐽 𝐵 = 27, respectively.…”
Section: Experiments 41 Experiments Preparationmentioning
confidence: 99%
See 1 more Smart Citation
“…To alleviate the manual effort of designing rules in rule-based methods, data-driven approaches have gradually become predominant in this field. Nyatsanga et al [2023] offer a thorough survey of these methods. Early data-driven approaches aim to directly learn mapping rules from data through statistical models [Levine et al 2010[Levine et al , 2009Neff et al 2008] and combine them with predefined gesture units for gesture generation.…”
Section: Related Work 21 Co-speech Gesture Synthesismentioning
confidence: 99%
“…As shown in the analysis of GENEA Challenge 2022 [15], the generated gestures have exhibited superior human-like similarity compared to motion capture data; nevertheless, enhancing the semantic expressiveness of these gestures remains an area requiring further exploration. Secondly, the deep learning-based methods for gesture synthesis tend to produce average results, often failing to generate nuanced and intricate hand movements [16]. Lastly, they may not adequately address the inherent intention in gesture expressions, which can lead to the generation of less plausible or contextually inappropriate gestures.…”
Section: Introductionmentioning
confidence: 99%