2013 IEEE International Conference on Acoustics, Speech and Signal Processing 2013
DOI: 10.1109/icassp.2013.6639038
|View full text |Cite
|
Sign up to set email alerts
|

Ideal ratio mask estimation using deep neural networks for robust speech recognition

Abstract: We propose a feature enhancement algorithm to improve robust automatic speech recognition (ASR). The algorithm estimates a smoothed ideal ratio mask (IRM) in the Mel frequency domain using deep neural networks and a set of time-frequency unit level features that has previously been used to estimate the ideal binary mask. The estimated IRM is used to filter out noise from a noisy Mel spectrogram before performing cepstral feature extraction for ASR. On the noisy subset of the Aurora-4 robust ASR corpus, the pro… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

3
263
0
4

Year Published

2016
2016
2019
2019

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 481 publications
(270 citation statements)
references
References 19 publications
3
263
0
4
Order By: Relevance
“…Besides getting the mapping feature directly, the DNN can also be used to train an ideal binary mask (IBM) which can be used to separate the clean speech from background noise as shown in Figure 10 [91,177,178]. With a priori knowledge of noise types and SNR, we can generate IBMs as training targets and use noisy power spectral as training data.…”
Section: Speech Recognition and Verification For The Internet Ofmentioning
confidence: 99%
“…Besides getting the mapping feature directly, the DNN can also be used to train an ideal binary mask (IBM) which can be used to separate the clean speech from background noise as shown in Figure 10 [91,177,178]. With a priori knowledge of noise types and SNR, we can generate IBMs as training targets and use noisy power spectral as training data.…”
Section: Speech Recognition and Verification For The Internet Ofmentioning
confidence: 99%
“…For robust ASR, the ideal ratio mask (IRM), defined as the ratio of speech energy to total energy (speech and noise) in each time-frequency unit, has been shown to have a better performance compared to using IBM in a large vocabulary speech recognition task [39]. In [40], the DNN is used to estimate the instantaneous SNR for computing IRM, subsequently applied to filter out noise from a noisy Mel spectrogram. The recurrent neural networks (RNNs), with their ability to model the temporal dependencies in speech, have also been employed to estimate the time-frequency masks from the magnitude spectrum of a noisy signal for speech enhancement and recognition [41].…”
Section: Speech Enhancement Using Dnnmentioning
confidence: 99%
“…The IRM is widely used in speech segregation, speech enhancement and noise robust ASR [13], [15], [17], [18]. The IRM is defined as follows: where S 2 (t, f ) and N 2 (t, f ) denote the speech and noise energy at a particular T-F point, respectively.…”
Section: A Irm Featuresmentioning
confidence: 99%