EmoTalk: Speech-driven emotional disentanglement for 3D face animation

Peng, Ziqiao; Wu, Haoyu; Xu, Hui; Zhu, Xiangyu; Liu, Hongyan; He, Jiang; Fan, Zhaoxin

doi:10.48550/arxiv.2303.11089

Cited by 5 publications

(6 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For example, EAMM (Ji et al 2022) aims at generating one-shot emotional talking faces on arbitrary subjects, and it extract emotion patterns from the source video. Emotalk (Peng et al 2023) is a speechdriven 3D face animation method, while our approach can be applied in both video-driven and audio-driven. GC-AVT…”

Section: Emotion Editing In Talking Head Videosmentioning

confidence: 99%

FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions

Sun,

Xuan,

Liu

et al. 2024

AAAI

View full text Add to dashboard Cite

Although deep generative models have greatly improved one-shot video-driven talking head generation, few studies address fine-grained controllable facial expression editing, which is crucial for practical applications. Existing methods rely on a fixed set of predefined discrete emotion labels or simply copy expressions from input videos. This is limiting as expressions are complex, and methods using only emotion labels cannot generate fine-grained, accurate or mixed expressions. Generating talking head video with precise expressions is also difficult using 3D model-based approaches, as 3DMM only models facial movements and tends to produce deviations. In this paper, we propose a novel framework enabling fine-grained facial expression editing in talking face generation. Our goal is to achieve expression control by manipulating the intensities of individual facial Action Units (AUs) or groups. First, compared with existing methods which decouple the face into pose and expression, we propose a disentanglement scheme to isolates three components from the human face, namely, appearance, pose, and expression. Second, we propose to use input AUs to control muscle group intensities in the generated face, and integrate the AUs features with the disentangled expression latent code. Finally, we present a self-supervised training strategy with well-designed constraints. Experiments show our method achieves fine-grained expression control, produces high-quality talking head videos and outperforms baseline methods.

show abstract

Section: Emotion Editing In Talking Head Videosmentioning

confidence: 99%

FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions

Sun,

Xuan,

Liu

et al. 2024

AAAI

View full text Add to dashboard Cite

show abstract

“…When interacting with virtual characters, real-time generation is critical to providing a realistic and immersive user experience. This allows for direct and normal communication and presence, as achieved by [59,61,101,102] (Scenario 1). The computational complexity required to rapidly process and render realistic audio-visual input makes it difficult to achieve realtime talking head production [58,62,103] (Scenario 4).…”

Section: Representation Of Realismmentioning

confidence: 99%

“…Similarly, Xing et al [110] introduce the innovative CodeTalker method, which aims to generate realistic facial animations from speech signals, enhancing the reality of virtual characters. Further, Peng et al [102] consider emotional expressions as a means of using animations with a heightened sense of reality. In particular, Haque and Yumak [62] combine speech-driven facial expressions with enhanced realism, offering users a more convincing and authentic experience.…”

Section: Covered Criteria For Talking Head Implementationmentioning

confidence: 99%

Application of a 3D Talking Head as Part of Telecommunication AR, VR, MR System: Systematic Review

Christoff,

Neshov,

Tonchev

et al. 2023

Electronics

View full text Add to dashboard Cite

In today’s digital era, the realms of virtual reality (VR), augmented reality (AR), and mixed reality (MR) collectively referred to as extended reality (XR) are reshaping human–computer interactions. XR technologies are poised to overcome geographical barriers, offering innovative solutions for enhancing emotional and social engagement in telecommunications and remote collaboration. This paper delves into the integration of (AI)-powered 3D talking heads within XR-based telecommunication systems. These avatars replicate human expressions, gestures, and speech, effectively minimizing physical constraints in remote communication. The contributions of this research encompass an extensive examination of audio-driven 3D head generation methods and the establishment of comprehensive evaluation criteria for 3D talking head algorithms within Shared Virtual Environments (SVEs). As XR technology evolves, AI-driven 3D talking heads promise to revolutionize remote collaboration and communication.

show abstract

“…To our knowledge, [Karras et al 2017;Peng et al 2023] addressed emotional expressiveness for audio-driven 3D facial animation synthesis task. Our goal is to explore and study the emotional expressiveness in speech-driven 3D facial animation synthesis in more detail and answer the research questions introduced in the previous section by proposing novel approaches for the synthesis task.…”

Section: Background and Related Workmentioning

confidence: 99%

“…However, vision-based 4D reconstruction models such as DECA [Feng et al 2021] and EMOCA [Danecek et al 2022] have gained traction in recent years for producing emotionally expressive 3D mesh sequences from videos. We have seen in [Ng et al 2022;Peng et al 2023], such vision-based models are used to create synthetic datasets using 2D videos. With EMOCA, we plan to employ a similar strategy to create our own synthetic dataset that will have labeled categories of emotion together with continuous valence and arousal information as depicted in Fig.…”

Section: Sub Rq3mentioning

confidence: 99%

Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans

Haque

2023

SIGGRAPH Asia 2023 Doctoral Consortium

View full text Add to dashboard Cite

This doctoral research focuses on generating expressive 3D facial animation for digital humans by studying and employing datadriven techniques. Face is the first point of interest during human interaction, and it is not any different for interacting with digital humans. Even minor inconsistencies in facial animation can disrupt user immersion. Traditional animation workflows prove realistic but time-consuming and labor-intensive that cannot meet the everincreasing demand for 3D contents in recent years. Moreover, recent data-driven approaches focus on speech-driven lip synchrony, leaving out facial expressiveness that resides throughout the face. To address the emerging demand and reduce production efforts, we explore data-driven deep learning techniques for generating controllable, emotionally expressive facial animation. We evaluate the proposed models against state-of-the-art methods and ground-truth, quantitatively, qualitatively, and perceptually. We also emphasize the need for non-deterministic approaches in addition to deterministic methods in order to ensure natural randomness in the non-verbal cues of facial animation. CCS CONCEPTS• Computing methodologies → Neural networks; Animation; • Human-centered computing → User studies.

show abstract

EmoTalk: Speech-driven emotional disentanglement for 3D face animation

Cited by 5 publications

References 46 publications

FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions

FG-EmoTalk: Talking Head Video Generation with Fine-Grained Controllable Facial Expressions

Application of a 3D Talking Head as Part of Telecommunication AR, VR, MR System: Systematic Review

Data-Driven Expressive 3D Facial Animation Synthesis for Digital Humans

Contact Info

Product

Resources

About