2023
DOI: 10.48550/arxiv.2301.13430
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GeneFace: Generalized and High-Fidelity Audio-Driven 3D Talking Face Synthesis

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…To tackle the above issues, there are methods to decouple the NeRF-based talking head generation process. Geneface [10] is the first method that attempts to achieve this process by facial landmarks. It utilizes variational auto-encoder (VAE) [31] to generate facial landmarks from audio, and then employs additional networks to refine these landmarks.…”
Section: B 3d-based Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…To tackle the above issues, there are methods to decouple the NeRF-based talking head generation process. Geneface [10] is the first method that attempts to achieve this process by facial landmarks. It utilizes variational auto-encoder (VAE) [31] to generate facial landmarks from audio, and then employs additional networks to refine these landmarks.…”
Section: B 3d-based Methodsmentioning
confidence: 99%
“…This insight sparks the idea of decoupling the NeRF-based talking head generation process through the utilization of facial landmarks. Actually, a few methods like [10], [11] have validated the potential of decoupling talking head generation via landmark-based neural radiation fields. However, they still have some limitations, such as the inability to generate landmarks that align with the training set distribution in a single attempt and the lack of precise control over the contribution of landmarks at each sampling point, which is also a common challenge faced by NeRF-based methods and leads to increased training time.…”
Section: Introductionmentioning
confidence: 99%
“…HuBERT Encoder To make full use of the information contained in the audio, we adopt a pre-trained HuBERT model to extract features. Instead of directly taking the final embedding as the subsequent input [36], we predict N hidden layers, which are weighted summed to feed into the Mon Decoder. The obtained HuBERT feature f h can be represented as:…”
Section: Architecturementioning
confidence: 99%
“…RAD-NeRF [12] decompose the inherently high-dimensional talking portrait representation into three low-dimensional feature grids, that makes can rending the talking portrait in real-time. GeneFace [14] propose a variational motion generator to generate accurate and expressive facial landmark and uses a NeRF-based renderer to render high-fidelity frames. Due to a lack of prior information, these tasks still struggle to render realistic expressions and natural movements.…”
Section: A Audio-driven Talking Face Generationmentioning
confidence: 99%