On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis

Hummersone, Christopher; Stokes, Toby; Brookes, Tim

doi:10.1007/978-3-642-55016-4_12

Cited by 73 publications

(42 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The IRM was employed as the training target for supervised speech segregation (Srinivasan et al, 2006;Narayanan and Wang, 2013;Hummersone et al, 2014;Wang et al, 2014). The IRM is defined as…”

Section: Irm Estimation Using Dnnmentioning

confidence: 99%

“…An ideal T-F mask indicates whether, or to what extent, each T-F unit is dominated by target speech. A binary decision leads to the ideal binary mask (IBM; Hu and Wang, 2001;Wang, 2005), whereas a ratio decision leads to the ideal ratio mask (IRM; Srinivasan et al, 2006;Narayanan and Wang, 2013;Hummersone et al, 2014;Wang et al, 2014). Unlike traditional speech enhancement, supervised segregation does not make explicit statistical assumptions about the underlying speech or noise signal, but rather learns data distributions from a training set.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises

Chen

Wang

Yoho

et al. 2016

The Journal of the Acoustical Society of America

171

127

View full text Add to dashboard Cite

Supervised speech segregation has been recently shown to improve human speech intelligibility in noise, when trained and tested on similar noises. However, a major challenge involves the ability to generalize to entirely novel noises. Such generalization would enable hearing aid and cochlear implant users to improve speech intelligibility in unknown noisy environments. This challenge is addressed in the current study through large-scale training. Specifically, a deep neural network (DNN) was trained on 10 000 noises to estimate the ideal ratio mask, and then employed to separate sentences from completely new noises (cafeteria and babble) at several signal-to-noise ratios (SNRs). Although the DNN was trained at the fixed SNR of À 2 dB, testing using hearing-impaired listeners demonstrated that speech intelligibility increased substantially following speech segregation using the novel noises and unmatched SNR conditions of 0 dB and 5 dB. Sentence intelligibility benefit was also observed for normal-hearing listeners in most noisy conditions. The results indicate that DNN-based supervised speech segregation with large-scale training is a very promising approach for generalization to new acoustic environments.

show abstract

Section: Irm Estimation Using Dnnmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises

Chen

Wang

Yoho

et al. 2016

The Journal of the Acoustical Society of America

171

127

View full text Add to dashboard Cite

show abstract

“…Given a mixture in STFT domain where the signal in each TF bin either belongs solely to the desired or the undesired signal, extraction can be performed using binary masks [16] (e.g., [6], [8]). Given a mixture in STFT domain where several sources are active in the same TF bin, ratio masks (RMs) [17] or complex ratio masks (CRMs) [18] can be applied. Both assign a gain to each mixture TF bin to estimate the desired spectrum.…”

Section: Introductionmentioning

confidence: 99%

Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters

Mack

Habets

2020

IEEE Signal Process. Lett.

View full text Add to dashboard Cite

Signal extraction from a single-channel mixture with additional undesired signals is most commonly performed using time-frequency (TF) masks. Typically, the mask is estimated with a deep neural network (DNN), and element-wise applied to the complex mixture short-time Fourier transform (STFT) representation to perform the extraction. Ideal mask magnitudes are zero for solely undesired signals in a TF bin and undefined for total destructive interference. Usually, masks have an upper bound to provide well-defined DNN outputs at the cost of limited extraction capabilities. We propose to estimate with a DNN a complex TF filter for each mixture TF bin which maps an STFT area in the respective mixture to the desired TF bin to address destructive interference in mixture TF bins. The DNN is optimized by minimizing the error between the extracted and the ground-truth desired signal allowing to learn the TF filters without having to specify ground-truth TF filters. We compare our approach with complex and real-valued TF masks by separating speech from a variety of different sound and noise classes from the Google AudioSet corpus. We also process the mixture STFT with notch-filters and zero whole time-frames, to simulate packet-loss during transmission, to demonstrate the reconstruction capabilities of our approach. The proposed method outperformed the baselines, especially when notch-filters and time-frame zeroing were applied.

show abstract

“…Para resolver este problema, a Ideal Ratio Mask (IRM) foi proposta em [197] com o objetivo de suavizar as unidades T-F ao invés de removê-las. A IRM proporciona um melhor desempenho porque está intimamente relacionada com o filtro de Wiener [123], onde um valor de SNR alto indica baixa atenuação da energia das unidades T-F, enquanto um valor de SNR baixo indica alta atenuação, suavizando todas as unidades T-F em vez de removê-las como o caso da IBM.…”

Section: Ideal Ratio Mask (Irm)unclassified

“…Utilizando as representações acústicas ou conjunto de características, o próximo passo é identificar as unidades que contêm informação dominante em relação ao ruído para agrupá-las e etiquetá-las como unidades confiáveis pertencentes ao mesmo som. Este procedimento pode ser realizado com máscaras baseadas na estimativa da SNR local [128] [197], máscaras baseadas na classificação Bayesiana do espectro [116][222], entre outras. Neste capítulo continuaremos com as máscaras baseadas na estimativa da SNR local, da mesma forma que nos capítulos 4 e 5.…”

Section: Capítulo 6 Segregação De Voz Usando a Máscara Inm Baseada Eunclassified

Realce E Reconhecimento De Voz Contínua Em Ambientes Adversos

GORDILLO¹

View full text Add to dashboard Cite

On the Ideal Ratio Mask as the Goal of Computational Auditory Scene Analysis

Cited by 73 publications

References 44 publications

Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises

Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises

Deep Filtering: Signal Extraction and Reconstruction Using Complex Time-Frequency Filters

Realce E Reconhecimento De Voz Contínua Em Ambientes Adversos

Contact Info

Product

Resources

About