2020
DOI: 10.1109/taslp.2019.2946789
|View full text |Cite
|
Sign up to set email alerts
|

Bridging the Gap Between Monaural Speech Enhancement and Recognition With Distortion-Independent Acoustic Modeling

Abstract: Monaural speech enhancement has made dramatic advances since the introduction of deep learning a few years ago.Although enhanced speech has been demonstrated to have better intelligibility and quality for human listeners, feeding it directly to automatic speech recognition (ASR) systems trained with noisy speech has not produced expected improvements in ASR performance. The lack of an enhancement benefit on recognition, or the gap between monaural speech enhancement and recognition, is often attributed to spee… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 33 publications
(8 citation statements)
references
References 38 publications
0
8
0
Order By: Relevance
“…In deep learning-based SE, the noisy speech y is processed by a DNN to generate an estimate to the clean speech signal, ŝ. As proved in [4], [8], the SE process adds some unwanted artifacts that negatively affects the enhanced speech signal, as it causes speech distortion. Considering the effect of these artifacts, the time domain enhanced speech signal, ŝ2 , that is generated by the second stage DE-CADE network, shown in Fig.…”
Section: Problem Definitionmentioning
confidence: 95%
See 3 more Smart Citations
“…In deep learning-based SE, the noisy speech y is processed by a DNN to generate an estimate to the clean speech signal, ŝ. As proved in [4], [8], the SE process adds some unwanted artifacts that negatively affects the enhanced speech signal, as it causes speech distortion. Considering the effect of these artifacts, the time domain enhanced speech signal, ŝ2 , that is generated by the second stage DE-CADE network, shown in Fig.…”
Section: Problem Definitionmentioning
confidence: 95%
“…Recent deep learning-based speech enhancement (SE) architectures have shown a great ability to generate estimated clean speech signals with high quality and intelligibility [1]- [3]. This allows these architectures to be employed for real-life SE applications, including Automatic Speech Recognition (ASR) [4], [5] and hearing aids [6], [7]. However, when applying SE architectures to these applications, other factors should be taken into consideration.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…First, ratio masking is well justified for separation under the assumption that target speech and background interference are uncorrelated, which holds well for additive noise (including background noise and interfering speech) but not for convolutive interference as in the case of reverberation [40]. Second, speech separation algorithms commonly introduce processing artifacts into the target speech signal [12], [42]. It is likely difficult for ratio masking to suppress such processing artifacts introduced by the separation module, particularly considering that these artifacts are correlated with the target speech signal.…”
Section: B Dereverberation Stagementioning
confidence: 99%