2021
DOI: 10.1109/tetci.2019.2917039
|View full text |Cite
|
Sign up to set email alerts
|

Lip-Reading Driven Deep Learning Approach for Speech Enhancement

Abstract: This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The proposed approach leverages the complementary strengths of both deep learning and analytical acoustic modelling (filtering based approach) as compared to recently published, comparatively simpler benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first level, a novel deep learning based lipreading regression model is emp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
56
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 52 publications
(57 citation statements)
references
References 43 publications
0
56
0
1
Order By: Relevance
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”
Section: Audio-visual Corporamentioning
confidence: 99%
See 1 more Smart Citation
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”
Section: Audio-visual Corporamentioning
confidence: 99%
“…Magnitude spectrogram [3]- [7], [10], [12], [17], [37], [42], [65], [66], [76], [77], [85], [99], [100], [107], [122], [128], [136], [153], [154], [164] [165], [176], [178], [179], [183], [192], [195], [203], [208], [220]- [222], [244], [263], [274], [279] Phase a [7], [10], [153] Complex spectrogram [55], [107], [109], [169], [239] Raw waveform [108], [273] Speaker embeddings [10], [85], [169], [192], [ [6],…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%
“…In the literature, extensive research has been carried out to develop audioonly (A-only) and audio-visual (AV) SE methods. Researchers have proposed several SE models such as deep neural network (DNN) based spectral mask estimation models [12,13], DNN based clean spectrogram estimation models [14,15], Wiener filtering based hybrid models [16,17,18], and time-domain SE models [19,20,21]. However, limited work has been conducted to develop robust language-independent, causal, speaker and noise-independent AV SE models for low SNRs (< −3 dB) observed in everyday social environments (such as cafeteria, and restaurants) where traditional A-only hearing aids fail to improve 3 the speech intelligibility.…”
Section: Introductionmentioning
confidence: 99%
“…However, the model fails to work when the visuals are occluded. Adeel et al[17,18] proposed a visual-only and AV SE models by integrating an enhanced visually-derived wiener filter (EVWF) and DNN based lip reading regression model. The preliminary evaluation demonstrated the effectiveness to deal with spectro-temporal variations in any wide variety of noisy environments.…”
mentioning
confidence: 99%
“…The AVIndex provided remarkably good predictions of overall audiovisual-speech intelligibility given its simplicity, suggesting that tweaks designed to address the model's faults as noted in Experiments 1 and 2 (e.g., over-prediction of intelligibility at very low SNRs) could yield a very accurate and informative model. Recent machine learning models designed to perform audiovisual speech enhancement suggest that it is possible to derive local, frame-byframe estimates of acoustic speech features directly from short segments of concurrent visual speech Adeel, Gogate, Hussain, & Whitmer, 2018).…”
Section: Discussionmentioning
confidence: 99%