Interspeech 2016 2016
DOI: 10.21437/interspeech.2016-483
|View full text |Cite
|
Sign up to set email alerts
|

Audio-to-Visual Speech Conversion Using Deep Neural Networks

Abstract: We study the problem of mapping from acoustic to visual speech with the goal of generating accurate, perceptually natural speech animation automatically from an audio speech signal. We present a sliding window deep neural network that learns a mapping from a window of acoustic features to a window of visual features from a large audio-visual speech dataset. Overlapping visual predictions are averaged to generate continuous, smoothly varying speech animation. We outperform a baseline HMM inversion approach in b… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
31
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 33 publications
(31 citation statements)
references
References 26 publications
0
31
0
Order By: Relevance
“…DNN-based models directly learn how to predict the movements from speech features. Taylor et al [3] proposed a fully connected feedforward neural network for audio to visual conversion. Their network gets the speech features over a specified contextual window, predicting current and future orofacial movements.…”
Section: Dnn-based Modelingmentioning
confidence: 99%
See 2 more Smart Citations
“…DNN-based models directly learn how to predict the movements from speech features. Taylor et al [3] proposed a fully connected feedforward neural network for audio to visual conversion. Their network gets the speech features over a specified contextual window, predicting current and future orofacial movements.…”
Section: Dnn-based Modelingmentioning
confidence: 99%
“…The integration between these factors in the orofacial area is complex [1,2]. Most of previous studies on lip movement synthesis have relied on the recordings from one subject in order to avoid speaker variations [3,4,5]. Since multimodal emotional corpora usually include multiple speakers with limited data per subject [6], it is important that the models can effectively capture speaker variability.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…via phonemic transcription. For direct approaches, the conversion function typically involves some form of regression [16,24,26,27] or indexing a codebook of visual features using the corresponding features extracted from the acoustic speech [3,13]. For indirect approaches, the mapping function involves concatenation or interpolation of pre-existing data [5,7,9,21,29] or using a generative model [2,10,17].…”
Section: Related Workmentioning
confidence: 99%
“…The human face has very complex muscular structure and bone structure. The human face contains different complex muscles and expression by a human is normally made by the stretching and contracting of a facial muscle or by relaxing or stressing of facial muscles [1] . For the current project, video of a face was taken that has hundreds of facial marks with the help of optical motion capture system.…”
Section: Introductionmentioning
confidence: 99%