2023
DOI: 10.1109/tmm.2022.3142387
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Driven Talking Face Video Generation With Dynamic Convolution Kernels

Abstract: In this paper, we present a dynamic convolution kernel (DCK) strategy for convolutional neural networks. Using a fully convolutional network with the proposed DCKs, highquality talking-face video can be generated from multi-modal sources (i.e., unmatched audio and video) in real time, and our trained model is robust to different identities, head postures, and input audios. Our proposed DCKs are specially designed for audio-driven talking face video generation, leading to a simple yet effective end-to-end syste… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
2

Relationship

1
8

Authors

Journals

citations
Cited by 39 publications
(14 citation statements)
references
References 38 publications
0
14
0
Order By: Relevance
“…As shown in Fig. 8(l), unlike the previous concatenation-based feature fusion strategy, Ye et al [74] presented a full convolutional neural network with dynamic convolution kernels (DCKs) for crossmodal feature fusion, which extracts features from audio and reshapes features as DCKs of the fully convolutional network. Due to the simple yet effective network architecture, the realtime performance of VSG is significantly improved.…”
Section: Other Methodsmentioning
confidence: 99%
“…As shown in Fig. 8(l), unlike the previous concatenation-based feature fusion strategy, Ye et al [74] presented a full convolutional neural network with dynamic convolution kernels (DCKs) for crossmodal feature fusion, which extracts features from audio and reshapes features as DCKs of the fully convolutional network. Due to the simple yet effective network architecture, the realtime performance of VSG is significantly improved.…”
Section: Other Methodsmentioning
confidence: 99%
“…Other methods rely on audio inputs to control the lower part of the face (lips, jaw movement) [5,16,28] to high-quality facial images while maintaining control over expression, illumination, and pose. However, Sty-leRig fails to exploit 3DMM's full expression space, resulting in incorrect expression mappings for the final result (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Therefore, using different static neural textures can represent different expressions and dynamic neural textures can be regarded as an approximation for a set of static neural textures. Inspired by dynamic convolution kernels [34], we can understand dynamic neural textures in the following way. Denote by E the space of all expressions.…”
Section: Dynamic Neural Texturesmentioning
confidence: 99%