2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.00802
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Cross-Modal Talking Face Generation With Dynamic Pixel-Wise Loss

Abstract: We devise a cascade GAN approach to generate talking face video, which is robust to different face shapes, view angles, facial characteristics, and noisy audio conditions. Instead of learning a direct mapping from audio to video frames, we propose first to transfer audio to high-level structure, i.e., the facial landmarks, and then to generate video frames conditioned on the landmarks. Compared to a direct audio-to-image approach, our cascade approach avoids fitting spurious correlations between audiovisual si… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
335
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 362 publications
(337 citation statements)
references
References 26 publications
1
335
0
1
Order By: Relevance
“…In this section, we compare ET-GAN with the state-of-the-art methods including Speech2Vid [23], DAVS [49], ATVG [8] and SDA [42] using metrics introduced above. As shown in Table 1, ET-GAN surpasses other state-of-the-art methods from quantitative aspect, which means ET-GAN has higher fidelity than others.…”
Section: Quantitative Resultsmentioning
confidence: 99%
See 3 more Smart Citations
“…In this section, we compare ET-GAN with the state-of-the-art methods including Speech2Vid [23], DAVS [49], ATVG [8] and SDA [42] using metrics introduced above. As shown in Table 1, ET-GAN surpasses other state-of-the-art methods from quantitative aspect, which means ET-GAN has higher fidelity than others.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…As shown in Figure 6, these continuous video frames come from the same identity saying the same word with different methods. Images in the first row are ground-truths, and the rest rows are state-of-thearts [8,23,42,49] and ET-GAN. Faces generated by Speech2Vid and ATVG barely contain hairs and necks.…”
Section: Qualitative Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Most of these applications involve image processing. Although there have been some studies involving video processing, such as video generation [115], video colorization [116], [117], video inpainting [118], motion transfer [119], and facial animation synthesis [120]- [123], the research on video using GANs is limited. In addition, although GANs have been applied to the generation and synthesis of 3D models, such as 3D colorization [124], 3D face reconstruction [125], [126], 3D character animation [127], and 3D textured object generation [128], the results are far from perfect.…”
Section: B Future Opportunitiesmentioning
confidence: 99%