In this paper, we demonstrate the significance of restoring harmonics of the fundamental frequency (pitch) in deep neural network (DNN) based speech enhancement. We propose a sliding-window attention network to regress the spectral magnitude mask (SMM) from the noisy speech signal. Even though the network parameters can be estimated by minimizing the mask loss, it does not restore the pitch harmonics, especially at higher frequencies. In this paper, we propose to restore the pitch harmonics in the spectral domain by minimizing cepstral loss around the pitch peak. The network parameters are estimated using a combination of the mask loss and cepstral loss. The proposed network architecture functions like an adaptive comb filter on voiced segments, and emphasizes the pitch harmonics in the speech spectrum. The proposed approach achieves comparable performance with the state-of-the-art methods with much lesser computational complexity 1 . Index Terms-Spectral magnitude mask, transformer, pitch harmonics, cepstral pitch peak and production-related loss.
End-to-end text-to-speech synthesis systems achieved immense success in recent times, with improved naturalness and intelligibility. However, the end-toend models, which primarily depend on the attention-based alignment, do not offer an explicit provision to modify/incorporate the desired prosody while synthesizing the signal. Moreover, the state-of-the-art end-to-end systems use autoregressive models for synthesis, making the prediction sequential. Hence, the inference time and the computational complexity are quite high. This paper proposes Prosody-TTS, an end-to-end speech synthesis model that combines the advantages of statistical parametric models and end-to-end neural network models. It also has a provision to modify or incorporate the desired prosody by controlling the fundamental frequency (f 0 ) and the phone duration. Generating speech samples with appropriate prosody and rhythm helps in improving the naturalness of the synthesized speech. We explicitly model the duration of the phoneme and the f 0 to have control over them during the synthesis. The model is trained in an end-toend fashion to directly generate the speech waveform from the input text, which in turn depends on the auxiliary subtasks of predicting the phoneme duration, f 0 , and mel spectrogram. Experiments on the Telugu language data of the IndicTTS database show that the proposed Prosody-TTS model achieves state-of-the-art performance with a mean opinion score of 4.08, with a very low inference time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.