Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation

Terissi, Lucas D.; Gómez, Juan Carlos

doi:10.1007/978-3-540-88190-2_9

Cited by 8 publications

(5 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given a speaker's audio information, the generation of the corresponding person speaking video has attracted many researchers' interests. Earlier works mainly used the Hidden Markov model (HMM) to generate corresponding relationships between speech and facial motions [9][10][11][12][13][14]. Among them, Brand [15] proposed voice puppetry as an HMMbased method for generating conversation faces driven only by voice signals.…”

Section: Related Workmentioning

confidence: 99%

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Zhang

Weng

2020

Complexity

View full text Add to dashboard Cite

In this work, we propose a method to transform a speaker’s speech information into a target character’s talking video; the method could make the mouth shape synchronization, expression, and body posture more realistic in the synthesized speaker video. This is a challenging task because changes of mouth shape and posture are coupled with audio semantic information. The model training is difficult to converge, and the model effect is unstable in complex scenes. Existing speech-driven speaker methods cannot solve this problem well. The method proposed in this paper first generates the sequence of key points of the speaker’s face and body postures from the audio signal in real time and then visualizes these key points as a series of two-dimensional skeleton images. Subsequently, we generate the final real speaker video through the video generation network. We take a random sampling of audio clips, encode audio contents and temporal correlations using a more effective network structure, and optimize and iterate network outputs using differential loss and attitude perception loss, so as to obtain a smoother pose key-point sequence and better performance. In addition, by inserting a specified action frame into the synthesized human pose sequence window, action poses of the synthesized speaker are enriched, making the synthesis effect more realistic and natural. Then, the final speaker video is generated by the obtained gesture key points through the video generation network. In order to generate realistic and high-resolution pose detail videos, we insert a local attention mechanism into the key point network of the generated pose sequence and give higher attention to the local details of the characters through spatial weight masks. In order to verify the effectiveness of the proposed method, we used the objective evaluation index NME and user subjective evaluation methods, respectively. Experiment results showed that our method could vividly use audio contentsto generate corresponding speaker videos, and its lip-matching accuracy and expression postures are better than those of previous work. Compared with existing methods in the NME index and user subjective evaluation, our method showed better results.

show abstract

Section: Related Workmentioning

confidence: 99%

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Zhang

Weng

2020

Complexity

View full text Add to dashboard Cite

show abstract

“…There exist a few approaches to speech-driven talking face generation. Early work in this field mostly used Hidden Markov Models (HMM) to model the correspondence between speech and facial movements [2,4,8,7,24,20,25]. One of the notable early work, Voice Puppetry [2], proposed an HMM-based talking face generation that is driven by only speech signal.…”

Section: Introductionmentioning

confidence: 99%

“…Choi et al [4] and Terissi et. al [20] used HMM inversion (HMMI) to estimate the visual parameters from speech. Zhang et al [25] used a DNN to map speech features into HMM states, which further maps to generated faces.…”

Section: Introductionmentioning

confidence: 99%

Generating Talking Face Landmarks from Speech

Eskimez

Maddox

et al. 2018

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The presence of a corresponding talking face has been shown to significantly improve speech intelligibility in noisy conditions and for hearing impaired population. In this paper, we present a system that can generate landmark points of a talking face from an acoustic speech in real time. The system uses a long short-term memory (LSTM) network and is trained on frontal videos of 27 different speakers with automatically extracted face landmarks. After training, it can produce talking face landmarks from the acoustic speech of unseen speakers and utterances. The training phase contains three key steps. We first transform landmarks of the first video frame to pin the two eye points into two predefined locations and apply the same transformation on all of the following video frames. We then remove the identity information by transforming the landmarks into a mean face shape across the entire training dataset. Finally, we train an LSTM network that takes the first-and second-order temporal differences of the log-mel spectrogram as input to predict face landmarks in each frame. We evaluate our system using the mean-squared error (MSE) loss of landmarks of lips between predicted and ground-truth landmarks as well as their first-and second-order temporal differences. We further evaluate our system by conducting subjective tests, where the subjects try to distinguish the real and fake videos of talking face landmarks. Both tests show promising results.

show abstract

“…Many approaches rely on non-linear statistical models which are trained on corpora of audio-visual speech and learn a mapping from some acoustic parameterization to a corresponding visual parameterization. A popular approach is to use hidden Markov models (HMMs) [13][14][15][16][17][18], which have been widely used by the speech community for decades for both speech recognition and synthesis. Chen [14] trained HMMs on joint audio-visual features then separated the models for prediction.…”

Section: Introductionmentioning

confidence: 99%

“…For new speech, the visual HMM was sampled using the acoustic state sequence as derived from the Viterbi algorithm. Choi et al [15] and Terrissi and Gomez [16] also trained joint audio-visual HMMs but ´ used HMM inversion (HMMI) to infer the visual parameters. Xie et al [17] introduced coupled HMMs (CHMMs) to account for the asynchrony between audio and visual activity caused by coarticulation [19].…”

Section: Introductionmentioning

confidence: 99%

Audio-to-Visual Speech Conversion Using Deep Neural Networks

Taylor¹,

Kato²,

Matthews³

et al. 2016

Interspeech 2016

View full text Add to dashboard Cite

We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in both objective and subjective evaluations and perform a thorough analysis of our results. Index Terms: Audio-to-visual conversion, automatic speech animation, sliding window deep neural networks.

show abstract

Audio-to-Visual Conversion Via HMM Inversion for Speech-Driven Facial Animation

Cited by 8 publications

References 11 publications

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Realistic Speech-Driven Talking Video Generation with Personalized Pose

Generating Talking Face Landmarks from Speech

Audio-to-Visual Speech Conversion Using Deep Neural Networks

Contact Info

Product

Resources

About