An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Sun, Zhongbo; Wang, Yannan; Cao, Li

doi:10.1007/978-3-030-37734-2_60

Cited by 4 publications

(13 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”

Section: Audio-visual Corporamentioning

confidence: 99%

“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

“…Magnitude spectrogram [3]- [7], [10], [12], [17], [37], [42], [65], [66], [76], [77], [85], [99], [100], [107], [122], [128], [136], [153], [154], [164] [165], [176], [178], [179], [183], [192], [195], [203], [208], [220]- [222], [244], [263], [274], [279] Phase a [7], [10], [153] Complex spectrogram [55], [107], [109], [169], [239] Raw waveform [108], [273] Speaker embeddings [10], [85], [169], [192], [ [6],…”

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

“…Recent works have used as acoustic input to the AV system either the magnitude spectrogram and the respective phase [7], [10], [153], the real and the imaginary parts of the complex spectrogram [55], [107], [109], [169], [239], or directly the raw waveform [108], [273]. Although these approaches allow to incorporate and process the full information of an acoustic signal, research in this area is still active and suggests that there is still room for improvement by exploiting the full information of the noisy speech signal [168], [281].…”

Section: B Acoustic Featuresmentioning

confidence: 99%

See 3 more Smart Citations

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

Keywords:Lombard effect audio-visual speech enhancement deep learning speech quality speech intelligibility A B S T R A C T When speaking in presence of background noise, humans reflexively change their way of speaking in order to improve the intelligibility of their speech. This reflex is known as Lombard effect. Collecting speech in Lombard conditions is usually hard and costly. For this reason, speech enhancement systems are generally trained and evaluated on speech recorded in quiet to which noise is artificially added. Since these systems are often used in situations where Lombard speech occurs, in this work we perform an analysis of the impact that Lombard effect has on audio, visual and audio-visual speech enhancement, focusing on deep-learning-based systems, since they represent the current state of the art in the field.We conduct several experiments using an audio-visual Lombard speech corpus consisting of utterances spoken by 54 different talkers. The results show that training deep-learning-based models with Lombard speech is beneficial in terms of both estimated speech quality and estimated speech intelligibility at low signal to noise ratios, where the visual modality can play an important role in acoustically challenging situations. We also find that a performance difference between genders exists due to the distinct Lombard speech exhibited by males and females, and we analyse it in relation with acoustic and visual features. Furthermore, listening tests conducted with audio-visual stimuli show that the speech quality of the signals processed with systems trained using Lombard speech is statistically significantly better than the one obtained using systems trained with non-Lombard speech at a signal to noise ratio of −5 dB. Regarding speech intelligibility, we find a general tendency of the benefit in training the systems with Lombard speech.

show abstract

Section: Audio-visual Corporamentioning

confidence: 99%

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

Section: B Acoustic Featuresmentioning

confidence: 99%

See 2 more Smart Citations

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Michelsanti

Sigurðsson

Jensen

2019

Speech Communication

View full text Add to dashboard Cite

show abstract

“…[7], [10], [153] Complex spectrogram [55], [107], [109], [169], [239] Raw waveform [108], [273] Speaker embeddings [10], [85], [169], [192], [208]…”

Section: Audio-visual Speech Enhancement and Separation Systemsmentioning

confidence: 99%

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

Michelsanti

Tan

Zhang

et al. 2021

IEEE/ACM Trans. Audio Speech Lang. Process.

191

View full text Add to dashboard Cite

Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deeplearning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.

show abstract

VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer

Montesinos

Kadandale

Haro

2022

Lecture Notes in Computer Science

View full text Add to dashboard Cite

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Cited by 4 publications

References 13 publications

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

Deep-learning-based audio-visual speech enhancement in presence of Lombard effect

An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and Separation

VoViT: Low Latency Graph-Based Audio-Visual Voice Separation Transformer

Contact Info

Product

Resources

About