2023
DOI: 10.1609/aaai.v37i2.25280
|View full text |Cite
|
Sign up to set email alerts
|

StyleTalk: One-Shot Talking Head Generation with Controllable Speaking Styles

Abstract: Different people speak with diverse personalized speaking styles. Although existing one-shot talking head methods have made significant progress in lip sync, natural facial expressions, and stable head motions, they still cannot generate diverse speaking styles in the final talking head videos. To tackle this problem, we propose a one-shot style-controllable talking face generation framework. In a nutshell, we aim to attain a speaking style from an arbitrary reference speaking video and then drive the one-shot… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 32 publications
(9 citation statements)
references
References 42 publications
0
9
0
Order By: Relevance
“…Compared with audio-driven methods, video-driven methods can utilize richer information contained in the input video to generate more natural and realistic results, which can be roughly classified into 2D keypoint-based methods (Siarohin et al 2019;Zhao and Zhang 2022), 2D GANbased methods (Wang et al 2022;Yin et al 2022), and 3Dmodel-based networks (Lahiri et al 2021;Hong et al 2022;Ma et al 2023). 2D keypoint-based methods first compute the transfer matrix via matching keypoint pairs between the source image and the driving image, then wrap the source image to get the dense flow, and finally generate images with a GAN generator (Hong et al 2022).…”
Section: Related Workmentioning
confidence: 99%
“…Compared with audio-driven methods, video-driven methods can utilize richer information contained in the input video to generate more natural and realistic results, which can be roughly classified into 2D keypoint-based methods (Siarohin et al 2019;Zhao and Zhang 2022), 2D GANbased methods (Wang et al 2022;Yin et al 2022), and 3Dmodel-based networks (Lahiri et al 2021;Hong et al 2022;Ma et al 2023). 2D keypoint-based methods first compute the transfer matrix via matching keypoint pairs between the source image and the driving image, then wrap the source image to get the dense flow, and finally generate images with a GAN generator (Hong et al 2022).…”
Section: Related Workmentioning
confidence: 99%
“…Recent advancements (Ji et al 2021;Pan et al 2023;Tan, Ji, and Pan 2023) have also explored the synthesis of emotional expressions in talking faces. Ji et al (2021) and Sinha et al (2021) utilize one-hot emotion labels as input to generate emotional talking faces, while others (Ji et al 2022;Ma et al 2023b) resort to another video for emotion source. In contrast, our approach offers a more user-friendly control by allowing users to input easy-to-use text descriptions to suggest the desired emotion style.…”
Section: Related Workmentioning
confidence: 99%
“…Then, we pass the stylized images through several state-of-the-art (SOTA) talking face generation methods to achieve the same task as our proposed method. The comparing methods include MakeItTalk (Zhou et al 2020), Wav2Lip (Prajwal et al 2020), Audio2Head (Wang et al 2021), PC-AVS (Zhou et al 2021), AVCT (Wang et al 2022a), EAMM (Ji et al 2022) and StyleTalk (Ma et al 2023b), where only the latter two methods support talking head generation with emotion style. We assess the results using evaluation metrics including SSIM (Wang et al 2004), FID (Heusel et al 2017) and PSNR for image generation quality, M-LMD (Chen et al 2019) for accuracy evaluation of lip movement, F-LMD (Chen et al 2019) for emotion style evaluation.…”
Section: Experiments Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…3D Morphable Models (3DMM) is another effective intermediate representation. AudioDVP 43 predicts facial expression parameters of 3DMM from audio and then rerenders the reenacted face after replacing expression parameters computed from the original target video with the predicted one; StyleTalk 44 obtains the stylized expression parameters from audio and reference style video, then generates the output video using them and the identity reference image. In addition to the explicit motion representations, implicit features have also been studied extensively.…”
Section: Talking Head Generationmentioning
confidence: 99%