Speech derverberation using a single microphone is addressed in this paper. Motivated by the recent success of the fully convolutional networks (FCN) in many image processing applications, we investigate their applicability to enhance the speech signal represented by short-time Fourier transform (STFT) images. We present two variations: a "U-Net" which is an encoder-decoder network with skip connections and a generative adversarial network (GAN) with U-Net as generator, which yields a more intuitive cost function for training. To evaluate our method we used the data from the REVERB challenge, and compared our results to other methods under the same conditions. We have found that our method outperforms the competing methods in most cases.
Abstract-In this paper we present a single-microphone speech enhancement algorithm. A hybrid approach is proposed merging the generative Mixture of Gaussians (MoG) model and the discriminative neural network (NN). The proposed algorithm is executed in two phases, the training phase, which does not recur, and the test phase. First, the noise-free speech power spectral density (PSD) is modeled as a MoG, representing the phoneme based diversity in the speech signal. An NN is then trained with phoneme labeled database for phoneme classification with mel-frequency cepstral coefficients (MFCC) as the input features. Given the phoneme classification results, an speech presence probability (SPP) is obtained using both the generative and discriminative models. Soft spectral subtraction is then executed while simultaneously, the noise estimation is updated. The discriminative NN maintain the continuity of the speech and the generative phoneme-based MoG preserves the speech spectral structure. Extensive experimental study using real speech and noise signals is provided. We also compare the proposed algorithm with alternative speech enhancement algorithms. We show that we obtain a significant improvement over previous methods in terms of both speech quality measures and speech recognition results.
We present a unified network for voice separation of an unknown number of speakers. The proposed approach is composed of several separation heads optimized together with a speaker classification branch. The separation is carried out in the time domain, together with parameter sharing between all separation heads. The classification branch estimates the number of speakers while each head is specialized in separating a different number of speakers. We evaluate the proposed model under both clean and noisy reverberant settings. Results suggest that the proposed approach is superior to the baseline model by a significant margin. Additionally, we present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.