Binary and ratio time-frequency masks for robust speech recognition

Srinivasan, S.; Roman, Nicoleta; Wang, DeLiang L.

doi:10.1016/j.specom.2006.09.003

Cited by 214 publications

(123 citation statements)

References 23 publications

Supporting

Mentioning

121

Contrasting

Unclassified

Order By: Relevance

“…The speech enhancement may be formulated as a binary classification problem to estimate the ideal binary mask (IBM), which is used to attenuate the energy within the noise dominant time-frequency units. For robust ASR, the ideal ratio mask (IRM), defined as the ratio of speech energy to total energy (speech and noise) in each time-frequency unit, has been shown to have a better performance compared to using IBM in a large vocabulary speech recognition task [39]. In [40], the DNN is used to estimate the instantaneous SNR for computing IRM, subsequently applied to filter out noise from a noisy Mel spectrogram.…”

Section: Speech Enhancement Using Dnnmentioning

confidence: 99%

Feature mapping using far-field microphones for distant speech recognition

Himawan

Motlíček

Sridharan

2016

Speech Communication

View full text Add to dashboard Cite

Acoustic modeling based on deep architectures has recently gained remarkable success, with substantial improvement of speech recognition accuracy in several automatic speech recognition (ASR) tasks. For distant speech recognition, the multi-channel deep neural network based approaches rely on the powerful modeling capability of deep neural network (DNN) to learn suitable representation of distant speech directly from its multi-channel source. In this model-based combination of multiple microphones, features from each channel are concatenated and used together as an input to DNN. This allows integrating the multi-channel audio for acoustic modeling without any pre-processing steps. Despite powerful modeling capabilities of DNN, an environmental mismatch due to noise and reverberation may result in severe performance degradation when features are simply fed to a DNN without a feature enhancement step. In this paper, we introduce the nonlinear bottleneck feature mapping approach using DNN, to transform the noisy and reverberant features to its clean version. The bottleneck features trained on clean signal are used as a teacher signal because they contain relevant information to phoneme classification, and the mapping is performed with the objective of suppressing noise and reverberation. The individual and combined impacts of beamforming and speaker adaptation techniques along with the feature mapping are examined for distant large vocabulary speech recognition, using a single and multiple far-field microphones. As an alternative to beamforming, experiments with concatenating multiple channel features are conducted. The experimental results on the AMI meeting corpus show that the feature mapping, used in combination with beamforming and speaker adaptation yields a distant speech recognition performance below 50% word error rate (WER), using DNN for acoustic modeling.

show abstract

Section: Speech Enhancement Using Dnnmentioning

confidence: 99%

Feature mapping using far-field microphones for distant speech recognition

Himawan

Motlíček

Sridharan

2016

Speech Communication

View full text Add to dashboard Cite

show abstract

“…Numerous algorithms have been proposed for developing the values of M [n, k] based on the inputs (e.g. [6,7,8,9,11,12,13]) and other variations are possible in which M [n, k] is a continuous function of the inputs rather than binary. In the algorithms considered, the mask M [n, k] is typically based on the cell-by-cell comparions of the left and right input signals; however, T-F masking is also widely applied to mono audio to improve signal quality for ASR [14,15,16] and for human intelligibility [17,18].…”

Section: Time-frequency Maskingmentioning

confidence: 99%

“…Results of previous studies using these techniques (e.g. [6,7,8,9,10,11,12]) suggest the following observations (among others): While T-F masking techniques are typically well motivated, there has been little formal mathematical analysis of them, with performance typically expressed in terms of secondary statistics such the accuracy of automatic speech recognition (ASR) systems. While it is true that algorithms developed to improve ASR recognition accuracy must be evaluated in terms of ASR performance, we also believe that further mathematical analysis and comparison to linear beamforming is potentially beneficial, as speech recognition experiments tend to This work has been supported by the National Science Foundation (Grant IIS-I0916918) and the Cisco Corporation (Grant 570877).…”

Section: Introductionmentioning

confidence: 99%

An analysis of binaural spectro-temporal masking as nonlinear beamforming

Moghimi

Stern

2014

2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Array-based time-frequency masking algorithms are an important type of nonlinear array processing. In this paper we develop a model that characterizes the directional sensitivity of these algorithms in a fashion similar to commonly-used the beam patterns used to characterize linear array processing. Two alternative formulations are described, and it is shown that one of these formulations predicts signal distortion and processing gain in time-frequency masking accurately, as well as speech recognition accuracy afforded by time-frequency masking in the presence of additive interfering sources.

show abstract

“…A main drawback of this system is that it performs ASR in the spectral domain which is known to be suboptimal, especially when the vocabulary size is large [21]. Further, since the T-F fragments are formed prior to the ASR decoding stage, the top-down models do not influence T-F unit level decisions.…”

Section: Prior Workmentioning

confidence: 99%

Coupling binary masking and robust ASR

Narayanan

Wang

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

Self Cite

View full text Add to dashboard Cite

We present a novel framework for performing speech separation and robust automatic speech recognition (ASR) in a unified fashion. Separation is performed by estimating the ideal binary mask (IBM), which identifies speech dominant and noise dominant units in a time-frequency (T-F) representation of the noisy signal. ASR is performed on extracted cepstral features after binary masking. Previous systems perform these steps in a sequential fashion -separation followed by recognition. The proposed framework, which we call bidirectional speech decoding (BSD), unifies these two stages. It does this by using multiple IBM estimators each of which is designed specifically for a back-end acoustic phonetic unit (BPU) of the recognizer. The standard ASR decoder is modified to use these IBM estimators to obtain BPU-specific cepstra during likelihood calculation. On the Aurora-4 robust ASR task, the proposed framework obtains a relative improvement of 17% in word error rate over the noisy baseline. It also obtains significant improvements in the quality of the estimated IBM.

show abstract

Binary and ratio time-frequency masks for robust speech recognition

Cited by 214 publications

References 23 publications

Feature mapping using far-field microphones for distant speech recognition

Feature mapping using far-field microphones for distant speech recognition

An analysis of binaural spectro-temporal masking as nonlinear beamforming

Coupling binary masking and robust ASR

Contact Info

Product

Resources

About