Speech-driven facial animation with spectral gathering and temporal attention

Chai, Yujin; Weng, Yanlin; Wang, Lvdi; Zhou, Kun

doi:10.1007/s11704-020-0133-7

Cited by 13 publications

(6 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Closer to our approach, a number of works produce 3D animations directly from speech [KAL*17; PWP18; TPL*20; RZW*21; CWWZ22]. Using formants as sound representation, Karras et al [KAL*17] achieves impressive results from less than 4 minutes of training data.…”

Section: Related Workmentioning

confidence: 99%

“…More recently, Chai et al [CWWZ22] gathers information along the frequency dimension of a speech window with a stack of convolutions, but uses self‐attention layers to collect information along the time dimension. Similar to Cudeiro et al [CBL*19], their model takes speaker identity as auxiliary input and is thus able to explicitly model different speaking styles.…”

Section: Related Workmentioning

confidence: 99%

“…Our approach is designed to produce lip sync suitable for bulk animations for a wide range of characters, as a lighter weight option to performance captures. Unlike [CBL*19; KAL*17; CWWZ22], we aim to produce generic animations, without speaker‐dependent characteristics. In contrast to e.g.…”

Section: Related Workmentioning

confidence: 99%

“…Recently, a number of methods have been proposed using Deep Learning (DL) to train speech‐driven models directly on performance capture‐based data [KAL*17; CBL*19; RZW*21; CWWZ22], generating animations with the qualities attributed to performance captures but requiring only recorded speech as input. While these works excel at e.g.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Voice2Face: Audio‐driven Facial and Tongue Rig Animations with cVAEs

Aylagas

Leon

Teye

et al. 2022

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Voice2Face: Audio‐driven Facial and Tongue Rig Animations with cVAEs

Aylagas

Leon

Teye

et al. 2022

Computer Graphics Forum

View full text Add to dashboard Cite

show abstract

“…There are several methods [10][11][12] to obtain 3D facial parameter representations from 2D monocular videos, but the quality of the synthesized 3D data receives limitations in the accuracy of 3D reconstruction techniques and 3D reconstruction techniques cannot realize subtle changes in 3D based on 2D videos, so this may lead to unreliable results. In works that generate 3D facial animations based on 3D meshes [13][14][15], they delay the speech input to short audio windows, which may lead to pauses in lip movements with speech changes, which further may affect the realistic facial changes.…”

Section: Introductionmentioning

confidence: 99%

3D head-talk: speech synthesis 3D head movement face animation

Yang

et al. 2022

Preprint

View full text Add to dashboard Cite

Speech-driven 3D human face animation has made admirable progress. However, synthesizing 3D facial speakers with head motion is still an unsolved problem. This is because head motion, as a speech-independent appearance representation, is difficult to model by a speech-driven approach. To solve this problem, we propose 3DHead-Talk, which generates 3D face animations combined with extreme head motion. In this work, we face a key challenge to generate natural head movements that match the speech rhythm. We first form an end-to-end autoregressive model by combining a dual-tower and single-tower Transformer, with a speech encoder encoding the long-term audio environment, a facial grid encoder encoding subtle changes in the vertices of the 3D facial grid, and a single-tower decoder automatically regressing to predict a series of 3D facial animation grids. Next the predicted 3D facial animation sequence is edited by a motion field generator containing head motion to obtain an output sequence containing extreme head motion. Finally, the natural 3D face animation under extreme head motion is presented in combination with the input audio. The quantitative and qualitative results show that our method outperforms current state-of-the-art methods. And stabilizes the non-area region while maintaining the appearance of extreme head motion.

show abstract