2022
DOI: 10.48550/arxiv.2203.04036
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

StyleHEAT: One-Shot High-Resolution Editable Talking Face Generation via Pre-trained StyleGAN

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
13
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(13 citation statements)
references
References 60 publications
0
13
0
Order By: Relevance
“…The model presented in [45] modularizes audio-visual representations by devising an implicit low-dimension pose code to tackle the problem of rhythmic head motion. In StyleHEAT [44], the authors show how to utilize StyleGAN [17] model to create talking faces guided by speech embeddings but also controlled by intuitive or attribute editing. Some modern approaches utilize rendering networks to obtain more accurate face 3D representation.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The model presented in [45] modularizes audio-visual representations by devising an implicit low-dimension pose code to tackle the problem of rhythmic head motion. In StyleHEAT [44], the authors show how to utilize StyleGAN [17] model to create talking faces guided by speech embeddings but also controlled by intuitive or attribute editing. Some modern approaches utilize rendering networks to obtain more accurate face 3D representation.…”
Section: Related Workmentioning
confidence: 99%
“…Generated heads move and behave in a natural expressive way while still preserving the subject's identity and plausible lip sync. In contrast to most recent approaches [3,15,24,29,35,40,[44][45][46], we use Denoising Diffusion Probabilistic Models [12,19] that utilize a variational approach instead of adversarial training and do not require stabilizing multiple discriminators. To eliminate the problem of unnaturally-looking sequences, we introduce motion frames (see Section 4.2) that are recurrently guiding video creation.…”
Section: Introductionmentioning
confidence: 99%
“…The former maps a randomly sampled noise to a style latent code of 512 dimensions while the latter produces satisfying images with this latent code and a constant input by Adaptive Instance Normalization layers. To deal with conditional synthesis tasks, recent methods [5,8,9,10,11,12] use a technique called GAN inversion [13]. GAN inversion is to map an image into the latent space of a pretrained GAN model for a desired latent code, which can be faithfully reconstructed afterwards.…”
Section: Preliminariesmentioning
confidence: 99%
“…Face editing via GAN (Generative Adversarial Network) inversion [1] enables users to flexibly edit a wide range of facial attributes in real face images. Existing methods [2,3,4,5] first invert face images into the latent space of 2D GANs such as StyleGAN [6], then manipulate the style codes, and finally feed the edited codes into the pre-trained generator to obtain the edited face images. However, 2D GANs lack the knowledge of the underlying 3D structure of the faces, and their 3D consistency in multi-view generation is limited, as shown in Fig.…”
Section: Introductionmentioning
confidence: 99%