SIGGRAPH Asia 2022 Conference Papers 2022
DOI: 10.1145/3550469.3555399
|View full text |Cite
|
Sign up to set email alerts
|

VideoReTalking: Audio-based Lip Synchronization for Talking Head Video Editing In the Wild

Abstract: We present VideoReTalking, a new system to edit the faces of a real-world talking head video according to input audio, producing a high-quality and lipsyncing output video even with a different emotion. Our system disentangles this objective into three sequential tasks: (1) face video generation with a canonical expression; (2) audio-driven lip-sync; and (3) face enhancement for improving photo-realism. Given a talking-head video, we first modify the expression of each frame according to the same expression te… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 47 publications
(14 citation statements)
references
References 49 publications
0
14
0
Order By: Relevance
“…Text-to-Video diffusion models [1,2,3,4,5,6,7,8,9,10,11] have become a powerful tool for producing high-quality video content from textual prompts. Pika [9] is a commercial text-to-video model by Pika Labs, which advances the field of video generation.…”
Section: Text-to-video Diffusion Modelsmentioning
confidence: 99%
See 3 more Smart Citations
“…Text-to-Video diffusion models [1,2,3,4,5,6,7,8,9,10,11] have become a powerful tool for producing high-quality video content from textual prompts. Pika [9] is a commercial text-to-video model by Pika Labs, which advances the field of video generation.…”
Section: Text-to-video Diffusion Modelsmentioning
confidence: 99%
“…(2) We provide a detailed and in-depth comparison with the text-to-image prompt-gallery dataset, DiffusionDB, and highlight the necessity of VidProM as well as real users' preference. (3) We reveal several exciting research directions inspired by VidProM and position it as a rich database for future studies.…”
Section: Introductionmentioning
confidence: 98%
See 2 more Smart Citations
“…Beyond faking faces or voices related to a person's identity through audio deepfake [23,36] or visual deepfake [25,41,44] techniques within each modality, malicious attackers can combine these technologies to create multimodal forged content, where both audio and visuals can be fake. Moreover, cutting-edge deepfake methods like Wav2Lip [37]and VideoReTalking [10] can even achieve crossmodal forgery by driving audio to generate precise lip-sync videos. Consequently, it is evident that deepfakes now encompass audio, visual, and even cross-modal forms.…”
Section: Introductionmentioning
confidence: 99%