2020
DOI: 10.1109/jstsp.2020.2987209
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Speech Separation and Dereverberation With a Two-Stage Multimodal Network

Abstract: Background noise, interfering speech and room reverberation frequently distort target speech in real listening environments. In this study, we address joint speech separation and dereverberation, which aims to separate target speech from background noise, interfering speech and room reverberation. In order to tackle this fundamentally difficult problem, we propose a novel multimodal network that exploits both audio and visual signals. The proposed network architecture adopts a two-stage strategy, where a separ… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
76
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 55 publications
(77 citation statements)
references
References 55 publications
0
76
0
1
Order By: Relevance
“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”
Section: Invariantmentioning
confidence: 99%
See 3 more Smart Citations
“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”
Section: Invariantmentioning
confidence: 99%
“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%
See 2 more Smart Citations
“…With the renaissance of neural networks, better objective performance can be achieved using deep learning methods [5,6,7]. However, it often results in greater amount of nonlinear distortion on the separated target speech [8,9,10], which harms the performance of ASR systems.…”
Section: Introductionmentioning
confidence: 99%