2019 IEEE International Conference on Multimedia &Amp; Expo Workshops (ICMEW) 2019
DOI: 10.1109/icmew.2019.00069
|View full text |Cite
|
Sign up to set email alerts
|

Audio2Face: Generating Speech/Face Animation from Single Audio with Attention-Based Bidirectional LSTM Networks

Abstract: We propose an end to end deep learning approach for generating real-time facial animation from just audio. Specifically, our deep architecture employs deep bidirectional long shortterm memory network and attention mechanism to discover the latent representations of time-varying contextual information within the speech and recognize the significance of different information contributed to certain face status. Therefore, our model is able to drive different levels of facial movements at inference and automatical… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
24
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 40 publications
(24 citation statements)
references
References 13 publications
0
24
0
Order By: Relevance
“…Technologies have recently been demonstrated such as NVIDIA's Audio2Face: to produce audio-driven AI-based facial characters [18]. This allows for a synthesized voice in real-time to drive a realistic facial animation system or a stylised Animojis system.…”
Section: Digital Assistants (Type 4)mentioning
confidence: 99%
See 1 more Smart Citation
“…Technologies have recently been demonstrated such as NVIDIA's Audio2Face: to produce audio-driven AI-based facial characters [18]. This allows for a synthesized voice in real-time to drive a realistic facial animation system or a stylised Animojis system.…”
Section: Digital Assistants (Type 4)mentioning
confidence: 99%
“…In other words, it is not controlled by a real video, but by an external device. This is the most experimental form of face control as purely voice-based solutions are only just emerging from research labs [18]. This form of face control will be necessary for the creation of virtual assistants as shown in type 4 in section 3.…”
Section: How the Face Is Controlled (Animation)mentioning
confidence: 99%
“…Recent advances in deep learning first boost lip-sync research at the audio-to-mouth stage. RNN-based architectures [3,6] facilitates the learning of the sequential mapping from audio signals to mouth movements. In the rendering stage, ObamaNet [2] is a representative work that demonstrates the power of neural networks [7] in synthesizing the photorealistic appearance.…”
Section: Introductionmentioning
confidence: 99%
“…In the rendering stage, ObamaNet [2] is a representative work that demonstrates the power of neural networks [7] in synthesizing the photorealistic appearance. These methods [8,6] provide a fully-trainable solution for the classical two-stage lip-sync scheme, which significantly improves both the lip-sync accuracy and processing efficiency.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation