Audio-visual speech enhancement using deep neural networks

Hou, Jen-Cheng; Wang, Syu-Siang; Lai, Ying‐Hui; Lin, Jen-Chun; Chang, Hung; Wang, Hsin‐Min

doi:10.1109/apsipa.2016.7820732

Cited by 28 publications

(20 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”

Section: Invariantmentioning

confidence: 99%

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

Keywords:Lombard effect audio-visual speech enhancement deep learning speech quality speech intelligibility A B S T R A C T When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

show abstract

Section: Invariantmentioning

confidence: 99%

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

show abstract

“…Similarly to our approach their visual features are not learned on the audio-visual dataset but are provided by a system trained on different dataset. Contrary to our approach, [19] uses position-based features while we use motion features (of the whole face) that in our experiments turned out to be much more effective than positional features.…”

Section: Related Workmentioning

confidence: 94%

“…A similar model is used in [18], where the model jointly generates clean speech and input video in a denoising-autoender architecture. [19] shows that using information about lip positions can help to improve speech enhancement. The video feature vector is obtained computing pair-wise distances between any mouth landmarks.…”

Section: Related Workmentioning

confidence: 99%

Face Landmark-based Speaker-independent Audio-visual Speech Enhancement in Multi-talker Environments

Morrone

Bergamaschi

Pasa

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In this paper, we address the problem of enhancing the speech of a speaker of interest in a cocktail party scenario when visual information of the speaker of interest is available.Contrary to most previous studies, we do not learn visual features on the typically small audio-visual datasets, but use an already available face landmark detector (trained on a separate image dataset).The landmarks are used by LSTM-based models to generate time-frequency masks which are applied to the acoustic mixed-speech spectrogram. Results show that: (i) landmark motion features are very effective features for this task, (ii) similarly to previous work, reconstruction of the target speaker's spectrogram mediated by masking is significantly more accurate than direct spectrogram reconstruction, and (iii) the best masks depend on both motion landmark features and the input mixed-speech spectrogram.To the best of our knowledge, our proposed models are the first models trained and evaluated on the limited size GRID and TCD-TIMIT datasets, that achieve speaker-independent speech enhancement in a multi-talker setting.

show abstract

“…Of the audio-visual speaker separation and speech enhancement methods that have been proposed (e.g. [39], [40]), these typically estimate a clean audio signal, rather than a mask, from audio features taken from the mixture and from visual features extracted from the speaker. In this work, we instead use the audio estimates taken from the visual features of each speaker to create either a binary mask or ratio mask.…”

Section: Introductionmentioning

confidence: 99%

Using Visual Speech Information in Masking Methods for Audio Speaker Separation

Khan

Milner

Cornu

2018

IEEE/ACM Trans. Audio Speech Lang. Process.

View full text Add to dashboard Cite

This work examines whether visual speech information can be effective within audio masking-based speaker separation to improve the quality and intelligibility of the target speech. Two visual-only methods of generating an audio mask for speaker separation are first developed. These use a deep neural network to map visual speech features to an audio feature space from which both visually-derived binary masks and visuallyderived ratio masks are estimated, before application to the speech mixture. Secondly, an audio ratio masking method forms a baseline approach for speaker separation which is extended to exploit visual speech information to form audio-visual ratio masks. Speech quality and intelligibility tests are carried out on the visual-only, audio-only and audio-visual masking methods of speaker separation at mixing levels from -10dB to +10dB. These reveal substantial improvements in the target speech when applying the visual-only and audio-only masks, but with highest performance occurring when combining audio and visual information to create the audio-visual masks.

show abstract

Audio-visual speech enhancement using deep neural networks

Cited by 28 publications

References 31 publications

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Face Landmark-based Speaker-independent Audio-visual Speech Enhancement in Multi-talker Environments

Using Visual Speech Information in Masking Methods for Audio Speaker Separation

Contact Info

Product

Resources

About