Many studies on deep learning-based speech enhancement (SE) utilizing the computational auditory scene analysis method typically employs the ideal binary mask or the ideal ratio mask to reconstruct the enhanced speech signal. However, many SE applications in real scenarios demand a desirable balance between denoising capability and computational cost. In this study, first, an improvement over the ideal ratio mask to attain more superior SE performance is proposed through introducing an efficient adaptive correlation-based factor for adjusting the ratio mask. The proposed method exploits the correlation coefficients among the noisy speech, noise and clean speech to effectively re-distribute the power ratio of the speech and noise during the ratio mask construction phase. Second, to make the supervised SE system more computationally-efficient, quantization techniques are considered to reduce the number of bits needed to represent floating numbers, leading to a more compact SE model. The proposed quantized correlation mask is utilized in conjunction with a 4-layer deep neural network (DNN-QCM) comprising dropout regulation, pre-training and noise-aware training to derive a robust and high-order mapping in enhancement, and to improve generalization capability in unseen conditions. Results show that the quantized correlation mask outperforms the conventional ratio mask representation and the other SE algorithms used for comparison. When compared to a DNN with ideal ratio mask as its learning targets, the DNN-QCM provided an improvement of approximately 6.5% in the short-time objective intelligibility score and 11.0% in the perceptual evaluation of speech quality score. The introduction of the quantization method can reduce the neural network weights to a 5-bit representation from a 32-bit, while effectively suppressing stationary and non-stationary noise. Timing analyses also show that with the techniques incorporated in the proposed DNN-QCM system to increase its compactness, the training and inference time can be reduced by 15.7% and 10.5%, respectively. INDEX TERMSCorrelation coefficients, deep neural network, dynamic noise-aware training, quantization, speech enhancement, training targets.
A feature extraction method through wavelet Mel-Frequency Cepstral Coefficients (MFCCs) is proposed for acoustic noise classification. The method combined with a wavelet sub-band selection technique and a feedforward neural network with two hidden layers, is a promising solution for a compact acoustic noise classification system that could be added to speech enhancement systems and deployed in hearing devices such as cochlear implants. The technique leads to higher classification accuracies (with a mean of 95.25%) across three SNR values, a significantly smaller feature set with 16 features, a reduced memory requirement, faster training convergence and lower computation cost by a factor of 0.69 in comparison to the traditional Short-Time Fourier Transform-based (STFTbased) technique.
Two auditory-inspired feature-extraction models, the Multi-Resolution CochleaGram (MRCG) and the Auditory Image Model (AIM) are compared on their acoustic noise classification performance, when combined with two supervised machine-learning algorithms, the ensemble bagged of decision trees or Support Vector Machine (SVM). Noise classification accuracies are then assessed in nine different sound environments with or without added speech and at different SNR ratios. The results demonstrate that classification scores using feature extraction with the MRCG model are significantly higher than when using the AIM model (p< 0.05), irrespective of machine-learning classifier. Using the SVM as a classifier also resulted in significantly better (p<0.05) classification performance over bagged trees, irrespective of featureextraction model. Overall, the MRCG model combined with SVM provides a more accurate classification for most of the sound stimuli tested. From the comparison study, suggestions on how auditory model-plus-machine-learning can be improved for the purpose of sound classification are offered.
Speech enhancement (SE) is used in many applications, such as hearing devices, to improve speech intelligibility and quality. Convolutional neural network-based (CNN-based) SE algorithms in literature often employ generic convolutional filters that are not optimized for SE applications. This paper presents a CNN-based SE algorithm with an adaptive filter design (named 'CNN-AFD') using Gabor function and region-aware convolution. The proposed algorithm incorporates fixed Gabor functions into convolutional filters to model human auditory processing for improved denoising performance. The feature maps obtained from the Gabor-incorporated convolutional layers serve as learnable guided masks (tuned at backpropagation) for generating adaptive custom region-aware filters. The custom filters extract features from speech regions (i.e., 'region-aware') while maintaining translation-invariance. To reduce the high cost of inference of the CNN, skip convolution and activation analysis-wise pruning are explored. Employing skip convolution allowed the training time per epoch to be reduced by close to 40%. Pruning of neurons with high numbers of zero activations complements skip convolution and significantly reduces model parameters by more than 30%. The proposed CNN-AFD outperformed all four CNN-based SE baseline algorithms (i.e., a CNN-based SE employing generic filters, a CNN-based SE without region-aware convolution, a CNN-based SE trained with complex spectrograms and a CNN-based SE processing in the time-domain) with an average of 0.95, 1.82 and 0.82 in short-time objective intelligibility (STOI), perceptual evaluation of speech quality (PESQ) and logarithmic spectral distance (LSD) scores, respectively, when tasked to denoise speech contaminated with NOISEX-92 noises at -5, 0 and 5 dB signal-to-noise ratios (SNRs).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.