2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021
DOI: 10.1109/iccv48922.2021.01110
|View full text |Cite
|
Sign up to set email alerts
|

Audio2Gestures: Generating Diverse Gestures from Speech Audio with Conditional Variational Autoencoders

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
26
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 78 publications
(26 citation statements)
references
References 23 publications
0
26
0
Order By: Relevance
“…Probabilistic frameworks were also used in the recent gesture generation works to handle the gesture ambiguities [1], [2], [25], [20], [6]. This type of methods learns a latent space generative model and could generate multiple gesture motions from the same speech input using conditional sampling from the latent space during inference.…”
Section: Related Work a Co-speech Gesture Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…Probabilistic frameworks were also used in the recent gesture generation works to handle the gesture ambiguities [1], [2], [25], [20], [6]. This type of methods learns a latent space generative model and could generate multiple gesture motions from the same speech input using conditional sampling from the latent space during inference.…”
Section: Related Work a Co-speech Gesture Generationmentioning
confidence: 99%
“…Adding adversarial training schemes [13], [31] improved the results, but at the cost of a more elaborated training process that requires careful tuning. More recent works explored the probabilistic framework [2], [20] to address the non-deterministic nature of gesture synthesis and sample new gestures during inferences. This line of work builds a latent space model (e.g., normalizing flows) from gesture motions to learn its conditional probability distributions.…”
Section: Introductionmentioning
confidence: 99%
“…Traum et al [39] used LSTMs to synthesise head movements from speech. Despite the use of RNNs in previous models, Li et al [21] shown that convolutional layers are better in the movements generation task, as it prevents the error accumulation, which is specific to RNN.…”
Section: Data-driven Approachesmentioning
confidence: 99%
“…Indeed, we did not use LSTMs but 1D convolution layers. This choice is motivated by the fact that Li et al [21] have shown that convolutional layers are better in the movements generation task, as it prevents the error accumulation, which is specific to RNN. Moreover, our outputs are not as Wu et al [42] who use only the 3D coordinates of the upper body movements.…”
Section: Data Normalisationmentioning
confidence: 99%
“…Recently a new paradigm demonstrating the use of multi-modal data has emerged. It has been shown of tremendous success for generating poses based on audio data in dance sequences [18,21,32], for speech gesture generation [11,20], and for learning new multi-modal features from visual and sound data that can be used with a variety of classic vision or audio tasks [2].…”
Section: Pose Estimation With Audiomentioning
confidence: 99%