Integrated Speech and Gesture Synthesis

Wang, Siyang; Alexanderson, Simon; Gustafson, Joakim; Beskow, Jonas; Henter, Gustav Eje; Székely, Éva

doi:10.1145/3462244.3479914

Cited by 11 publications

(3 citation statements)

References 51 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another line of work has considered training verbal (text‐to‐speech) and non‐verbal (speech‐to‐gesture) synthesis systems on the same data [ASH*20] and, subsequently, merging them into one single network that generates both speech audio and gesture motion [WAG*21]. Given the strides that have been made in generating convincing speech audio from text [TQSL21], adapting successful text‐to‐speech methods to simultaneously generate both acoustics and joint rotations, as was done in [WAG*21], seems like a compelling direction for future work. This not only brings advantages in terms of modeling efficiency (the gesture‐generation systems will possess information about, e.g.…”

Section: Key Challenges Of Gesture Generationmentioning

confidence: 99%

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Nyatsanga

Kucherenko²,

Ahuja³

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

Gestures that accompany speech are an essential part of natural and efficient embodied human communication. The automatic generation of such co‐speech gestures is a long‐standing problem in computer animation and is considered an enabling technology for creating believable characters in film, games, and virtual social spaces, as well as for interaction with social robots. The problem is made challenging by the idiosyncratic and non‐periodic nature of human co‐speech gesture motion, and by the great diversity of communicative functions that gestures encompass. The field of gesture generation has seen surging interest in the last few years, owing to the emergence of more and larger datasets of human gesture motion, combined with strides in deep‐learning‐based generative models that benefit from the growing availability of data. This review article summarizes co‐speech gesture generation research, with a particular focus on deep generative models. First, we articulate the theory describing human gesticulation and how it complements speech. Next, we briefly discuss rule‐based and classical statistical gesture synthesis, before delving into deep learning approaches. We employ the choice of input modalities as an organizing principle, examining systems that generate gestures from audio, text and non‐linguistic input. Concurrent with the exposition of deep learning approaches, we chronicle the evolution of the related training data sets in terms of size, diversity, motion quality, and collection method (e.g., optical motion capture or pose estimation from video). Finally, we identify key research challenges in gesture generation, including data availability and quality; producing human‐like motion; grounding the gesture in the co‐occurring speech in interaction with other speakers, and in the environment; performing gesture evaluation; and integration of gesture synthesis into applications. We highlight recent approaches to tackling the various key challenges, as well as the limitations of these approaches, and point toward areas of future development.

show abstract

Section: Key Challenges Of Gesture Generationmentioning

confidence: 99%

A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation

Nyatsanga

Kucherenko²,

Ahuja³

et al. 2023

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

“…Yoon et al [2019] build a GRU-based model for gesture generation, where the model is trained on the TED dataset. Wang et al [2021a] propose to improve the motion quality by jointly synthesizing speech and gestures from the text in an integrated LSTM architecture. Liu et al [2022c] propose a cascaded LSTM and MLP by integrating emotion, speaker identity, and style features for motion synthesis.…”

Section: Related Workmentioning

confidence: 99%

Bodyformer: Semantics-guided 3D Body Gesture Synthesis with Transformer

et al. 2023

View full text Add to dashboard Cite

Automatic gesture synthesis from speech is a topic that has attracted researchers for applications in remote communication, video games and Metaverse. Learning the mapping between speech and 3D full-body gestures is difficult due to the stochastic nature of the problem and the lack of a rich cross-modal dataset that is needed for training. In this paper, we propose a novel transformer-based framework for automatic 3D body gesture synthesis from speech. To learn the stochastic nature of the body gesture during speech, we propose a variational transformer to effectively model a probabilistic distribution over gestures, which can produce diverse gestures during inference. Furthermore, we introduce a mode positional embedding layer to capture the different motion speeds in different speaking modes. To cope with the scarcity of data, we design an intra-modal pre-training scheme that can learn the complex mapping between the speech and the 3D gesture from a limited amount of data. Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset. The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.

show abstract

“…Gesture generation is a complex task that requires understanding speech, gestures, and their relationships. The present data-driven studies mainly consider four modalities: text [6,80,89], audio [20,24,62], gesture motion [52,85,88], and speaker identity [3,4,50]. There are some works to extend the scale of the dataset.…”

Section: Gesture Generationmentioning

confidence: 99%