Proceedings of the 2021 International Conference on Multimodal Interaction 2021
DOI: 10.1145/3462244.3479914
|View full text |Cite
|
Sign up to set email alerts
|

Integrated Speech and Gesture Synthesis

Abstract: Text-to-speech and co-speech gesture synthesis have until now been treated as separate areas by two different research communities, and applications merely stack the two technologies using a simple system-level pipeline. This can lead to modeling inefficiencies and may introduce inconsistencies that limit the achievable naturalness. We propose to instead synthesize the two modalities in a single model, a new problem we call integrated speech and gesture synthesis (ISG). We also propose a set of models modified… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 51 publications
0
3
0
Order By: Relevance
“…Another line of work has considered training verbal (text‐to‐speech) and non‐verbal (speech‐to‐gesture) synthesis systems on the same data [ASH*20] and, subsequently, merging them into one single network that generates both speech audio and gesture motion [WAG*21]. Given the strides that have been made in generating convincing speech audio from text [TQSL21], adapting successful text‐to‐speech methods to simultaneously generate both acoustics and joint rotations, as was done in [WAG*21], seems like a compelling direction for future work. This not only brings advantages in terms of modeling efficiency (the gesture‐generation systems will possess information about, e.g.…”
Section: Key Challenges Of Gesture Generationmentioning
confidence: 99%
“…Another line of work has considered training verbal (text‐to‐speech) and non‐verbal (speech‐to‐gesture) synthesis systems on the same data [ASH*20] and, subsequently, merging them into one single network that generates both speech audio and gesture motion [WAG*21]. Given the strides that have been made in generating convincing speech audio from text [TQSL21], adapting successful text‐to‐speech methods to simultaneously generate both acoustics and joint rotations, as was done in [WAG*21], seems like a compelling direction for future work. This not only brings advantages in terms of modeling efficiency (the gesture‐generation systems will possess information about, e.g.…”
Section: Key Challenges Of Gesture Generationmentioning
confidence: 99%
“…Yoon et al [2019] build a GRU-based model for gesture generation, where the model is trained on the TED dataset. Wang et al [2021a] propose to improve the motion quality by jointly synthesizing speech and gestures from the text in an integrated LSTM architecture. Liu et al [2022c] propose a cascaded LSTM and MLP by integrating emotion, speaker identity, and style features for motion synthesis.…”
Section: Related Workmentioning
confidence: 99%
“…Gesture generation is a complex task that requires understanding speech, gestures, and their relationships. The present data-driven studies mainly consider four modalities: text [6,80,89], audio [20,24,62], gesture motion [52,85,88], and speaker identity [3,4,50]. There are some works to extend the scale of the dataset.…”
Section: Gesture Generationmentioning
confidence: 99%