Proceedings of the 19th ACM International Conference on Intelligent Virtual Agents 2019
DOI: 10.1145/3308532.3329472
|View full text |Cite
|
Sign up to set email alerts
|

Analyzing Input and Output Representations for Speech-Driven Gesture Generation

Abstract: This paper presents a novel framework for automatic speech-driven gesture generation, applicable to human-agent interaction including both virtual agents and robots. Specifically, we extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning. Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.Our approach consists of two steps. First, we learn a lowerdimensional representation of … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
107
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
2
1
1

Relationship

1
7

Authors

Journals

citations
Cited by 131 publications
(108 citation statements)
references
References 39 publications
0
107
0
1
Order By: Relevance
“…Audio-Driven Gesture Generation. Most prior work on datadriven gesture generation has used the audio-signal as the only speech-input modality in the model [14,15,19,28,42]. For example, Sadoughi and Busso [42] trained a probabilistic graphical model to generate a discrete set of gestures based on the speech audiosignal, using discourse functions as constraints.…”
Section: 21mentioning
confidence: 99%
See 2 more Smart Citations
“…Audio-Driven Gesture Generation. Most prior work on datadriven gesture generation has used the audio-signal as the only speech-input modality in the model [14,15,19,28,42]. For example, Sadoughi and Busso [42] trained a probabilistic graphical model to generate a discrete set of gestures based on the speech audiosignal, using discourse functions as constraints.…”
Section: 21mentioning
confidence: 99%
“…Hasegawa et al [19] developed a more general model capable of generating arbitrary 3D motion using a deep recurrent neural network, applying smoothing as postprocessing step. Kucherenko et al [28] extended this work by applying representation learning to the human pose and reducing the need for smoothing. Recently, Ginosar et al [15] applied a convolutional neural network with adversarial training to generate 2D poses from spectrogram features.…”
Section: 21mentioning
confidence: 99%
See 1 more Smart Citation
“…Hasegawa et al [34] discussed a data-driven model for metaphoric gesture motion synthesis for a stick figure based on a speech input in Japanese; however, the generated gestures were rated relatively lower than the original gestures in semantic consistency. This model was further improved through motion representation learning to ameliorate gesture motion synthesis [49] but using the same language. Yoon et al [91] introduced a data-driven endto-end robot model for generating different categories of gestures (including iconic and metaphoric gestures) based on an input text and not a direct speech, which is similar to the rule-based gesture generators explained earlier.…”
Section: Related Workmentioning
confidence: 99%
“…A recurrent neural network trained on person-specific gesture-speech sequences (motion and audio data from talk shows), was able to produce novel speech-synchronous gestures based on novel speech from the person the neural network was trained on (Ginosar et al, 2019). These neural networks are thus showing that there must be some person-specific invariant between speech acoustics and gesture motion, although it remains unknown what the neural network in fact picked up on in speech so as to produce gesture so well (but see Kucherenko, Hasegawa, Henter, Kaneko, & Kjellström, 2019).…”
mentioning
confidence: 99%