2006
DOI: 10.1016/j.specom.2006.09.003
|View full text |Cite
|
Sign up to set email alerts
|

Binary and ratio time-frequency masks for robust speech recognition

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
121
0
2

Year Published

2012
2012
2020
2020

Publication Types

Select...
7
1
1

Relationship

2
7

Authors

Journals

citations
Cited by 214 publications
(123 citation statements)
references
References 23 publications
0
121
0
2
Order By: Relevance
“…The speech enhancement may be formulated as a binary classification problem to estimate the ideal binary mask (IBM), which is used to attenuate the energy within the noise dominant time-frequency units. For robust ASR, the ideal ratio mask (IRM), defined as the ratio of speech energy to total energy (speech and noise) in each time-frequency unit, has been shown to have a better performance compared to using IBM in a large vocabulary speech recognition task [39]. In [40], the DNN is used to estimate the instantaneous SNR for computing IRM, subsequently applied to filter out noise from a noisy Mel spectrogram.…”
Section: Speech Enhancement Using Dnnmentioning
confidence: 99%
“…The speech enhancement may be formulated as a binary classification problem to estimate the ideal binary mask (IBM), which is used to attenuate the energy within the noise dominant time-frequency units. For robust ASR, the ideal ratio mask (IRM), defined as the ratio of speech energy to total energy (speech and noise) in each time-frequency unit, has been shown to have a better performance compared to using IBM in a large vocabulary speech recognition task [39]. In [40], the DNN is used to estimate the instantaneous SNR for computing IRM, subsequently applied to filter out noise from a noisy Mel spectrogram.…”
Section: Speech Enhancement Using Dnnmentioning
confidence: 99%
“…Numerous algorithms have been proposed for developing the values of M [n, k] based on the inputs (e.g. [6,7,8,9,11,12,13]) and other variations are possible in which M [n, k] is a continuous function of the inputs rather than binary. In the algorithms considered, the mask M [n, k] is typically based on the cell-by-cell comparions of the left and right input signals; however, T-F masking is also widely applied to mono audio to improve signal quality for ASR [14,15,16] and for human intelligibility [17,18].…”
Section: Time-frequency Maskingmentioning
confidence: 99%
“…Results of previous studies using these techniques (e.g. [6,7,8,9,10,11,12]) suggest the following observations (among others): While T-F masking techniques are typically well motivated, there has been little formal mathematical analysis of them, with performance typically expressed in terms of secondary statistics such the accuracy of automatic speech recognition (ASR) systems. While it is true that algorithms developed to improve ASR recognition accuracy must be evaluated in terms of ASR performance, we also believe that further mathematical analysis and comparison to linear beamforming is potentially beneficial, as speech recognition experiments tend to This work has been supported by the National Science Foundation (Grant IIS-I0916918) and the Cisco Corporation (Grant 570877).…”
Section: Introductionmentioning
confidence: 99%
“…A main drawback of this system is that it performs ASR in the spectral domain which is known to be suboptimal, especially when the vocabulary size is large [21]. Further, since the T-F fragments are formed prior to the ASR decoding stage, the top-down models do not influence T-F unit level decisions.…”
Section: Prior Workmentioning
confidence: 99%