Speech enhancement (SE) is an important method for improving speech quality and intelligibility in noisy environments where received speech is severely distorted by noise. An efficient speech enhancement system relies on accurately modelling the long-term dependencies of noisy speech. Deep learning has greatly benefited by the use of transformers where long-term dependencies can be modelled more efficiently with multi-head attention (MHA) by using sequence similarity. Transformers frequently outperform recurrent neural network (RNN) and convolutional neural network (CNN) models in many tasks while utilizing parallel processing. In this paper we proposed a two-stage convolutional transformer for speech enhancement in time domain. The transformer considers global information as well as parallel computing, resulting in a reduction of long-term noise. In the proposed work unlike twostage transformer neural network (TSTNN) different transformer structures for intra and inter transformers are used for extracting the local as well as global features of noisy speech. Moreover, a CNN module is added to the transformer so that short-term noise can be reduced more effectively, based on the ability of CNN to extract local information. The experimental findings demonstrate that the proposed model outperformed the other existing models in terms of STOI (short-time objective intelligibility), and PESQ (perceptual evaluation of the speech quality).
Speech enhancement (SE) is an important method for improving speech quality and intelligibility in noisy environments. An effective speech enhancement model depends on precise modelling of the long-range dependencies of noisy speech. Several recent studies have examined ways to enhance speech by capturing the long-term contextual information. For speech enhancement, the time-frequency (T-F) distribution of speech spectral components is also important, but is usually ignored in these studies. The multi-stage learning method is an effective way to integrate various deep-learning modules at the same time. The benefit of multi-stage training is that the optimization target can be iteratively updated stage by stage. In this paper speech enhancement is investigated by multi-stage learning using a multistage structure in which time-frequency attention (TFA) blocks are followed by stacks of squeezed temporal convolutional networks (S-TCN) with exponentially increasing dilation rates. To reinject original information into later stages, a feature fusion (FF) block is inserted at the input of later stages to reduce the possibility of speech information being lost in the early stages. The S-TCN blocks are responsible for temporal sequence modelling task. The time-frequency attention (TFA) is a simple but effective network module that explicitly exploits position information to generate a 2D attention map to characterise the salient T-F distribution of speech by using two branches, time-frame attention and frequency attention in parallel. A set of utterances from the LibriSpeech and Voicebank databases are used to evaluate the performance of the proposed SE. Extensive experiments have demonstrated that the proposed model consistently improves the performance over existing baselines across two widely used objective metrics such as PESQ and STOI. The average PESQ and STOI for proposed model are boosted by a factor of 41.7% and 5.4% for Libri speech dataset, 36.10% and 3.1% for Voice bank dataset as compared to noisy speech. Additionally, we explored the generalization of the proposed TFA-S-TCN model across different speech datasets through cross data base analysis. A significant improvement in system robustness to noise is also shown by our evaluation results using the TFA module.
Over the past 10 years, deep learning has enabled significant advancements in the improvement of noisy speech. In an end-to-end speech enhancement, the deep neural networks transform a noisy speech signal to a clean speech signal in the time domain directly without any conversion or estimation of mask. Recently, the U-Net-based models achieved good enhancement performance. Despite this, some of them may neglect context-related information and detailed features of input speech in case of ordinary convolution. To address the above issues, recent studies have upgraded the performance of the model by adding various network modules such as attention mechanisms, long and short-term memory (LSTM). In this work, we propose a new U-Net-based speech enhancement model using a novel lightweight and efficient Shuffle Attention (SA), Gated Recurrent Unit (GRU), residual blocks with dilated convolutions. Residual block will be followed by a multi-scale convolution block (MSCB). The proposed hybrid structure enables the temporal context aggregation in time domain. The advantage of shuffle attention mechanism is that the channel and spatial attention are carried out simultaneously for each sub-feature in order to prevent potential noises while also highlighting the proper semantic feature areas by combining the same features from all locations. MSCB is employed for extracting rich temporal features. To represent the correlation between neighboring noisy speech frames, a two Layer GRU is added in the bottleneck of U-Net. The experimental findings demonstrate that the proposed model outperformed the other existing models in terms of short-time objective intelligibility (STOI), and perceptual evaluation of the speech quality (PESQ).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.