2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953127
|View full text |Cite
|
Sign up to set email alerts
|

Vid2speech: Speech reconstruction from silent video

Abstract: Speechreading is a notoriously difficult task for humans to perform. In this paper we present an end-to-end model based on a convolutional neural network (CNN) for generating an intelligible acoustic speech signal from silent video frames of a speaking person. The proposed CNN generates sound features for each frame based on its neighboring frames. Waveforms are then synthesized from the learned speech features to produce intelligible speech. We show that by leveraging the automatic feature learning capabiliti… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
114
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 108 publications
(115 citation statements)
references
References 16 publications
0
114
1
Order By: Relevance
“…In this experiment, we attempted to separate the speech of two 'unknown' speakers. First, we trained a vid2speech network [5] on the data of a 'known' speaker (S2 from GRID). The training data consisted of randomly selected sentences (40 minutes length in total).…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…In this experiment, we attempted to separate the speech of two 'unknown' speakers. First, we trained a vid2speech network [5] on the data of a 'known' speaker (S2 from GRID). The training data consisted of randomly selected sentences (40 minutes length in total).…”
Section: Resultsmentioning
confidence: 99%
“…Several approaches exist for generation of intelligible speech from silent video frames of a person speaking [5,6,7]. In this work we rely on vid2speech [6], briefly described in Sec.…”
Section: Visually-derived Speech Generationmentioning
confidence: 99%
See 1 more Smart Citation
“…Convolutional neural networks (CNNs) have been shown to be powerful feature extractors for images and videos and have replaced handcrafted features in more recent works. One such system is proposed in [8] to predict line spectrum pairs (LSPs) from video. The LSPs are converted into waveforms but since excitation is not predicted the resulting speech sounds unnatural.…”
Section: Introductionmentioning
confidence: 99%
“…This approach had the limitation of missing certain speech components such as fundamental frequency and aperiodicity which was then determined artificially thereby compromising quality in order to maximize intelligibility. Ephrat and Peleg (2017) modified this technique by using an end-to-end CNN to extract visual features from the entire face while applying a similar approach for modeling audio features using 8th order Linear Predictive Coding (LPC) analysis followed by Line Spectrum Pairs (LSP) decomposition. However, it also suffered from the same missing excitation parameters resulting in an unnatural sounding voice.…”
Section: Related Workmentioning
confidence: 99%