2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA) 2016
DOI: 10.1109/apsipa.2016.7820732
|View full text |Cite
|
Sign up to set email alerts
|

Audio-visual speech enhancement using deep neural networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
19
0
1

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 28 publications
(20 citation statements)
references
References 31 publications
0
19
0
1
Order By: Relevance
“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”
Section: Invariantmentioning
confidence: 99%
“…Estimators of speech SII [110] 1997 Used for additive stationary noise or [108] intelligibility bandwidth reduction CSII [130] 2004 Extension of SII for broadband peak- [108] clipping and center-clipping distortion ESII [210] 2005 Extension of SII for fluctuating noise [108] STOI [241] 2011 Able to predict quite accurately speech [7], [37], [55], [77], [85], [108], [109] intelligibility in several situations [99], [122], [128], [136] HASPI [132] 2014 Specifically designed for hearing- [99], [100] impaired listeners ESTOI [124] 2016 Extension of STOI for highly [107], [108], [176], [178], [179], [244] optimally performed, because floor or ceiling effects might occur if the listeners' task is too hard or too easy. This issue can be mitigated by testing the system at several SNR within a pre-determined range, at the expense of the time needed to conduct the listening experiments.…”
Section: Invariantmentioning
confidence: 99%
“…Similarly to our approach their visual features are not learned on the audio-visual dataset but are provided by a system trained on different dataset. Contrary to our approach, [19] uses position-based features while we use motion features (of the whole face) that in our experiments turned out to be much more effective than positional features.…”
Section: Related Workmentioning
confidence: 94%
“…A similar model is used in [18], where the model jointly generates clean speech and input video in a denoising-autoender architecture. [19] shows that using information about lip positions can help to improve speech enhancement. The video feature vector is obtained computing pair-wise distances between any mouth landmarks.…”
Section: Related Workmentioning
confidence: 99%
“…Of the audio-visual speaker separation and speech enhancement methods that have been proposed (e.g. [39], [40]), these typically estimate a clean audio signal, rather than a mask, from audio features taken from the mixture and from visual features extracted from the speaker. In this work, we instead use the audio estimates taken from the visual features of each speaker to create either a binary mask or ratio mask.…”
Section: Introductionmentioning
confidence: 99%