This paper describes a novel loss function for training feedforward neural networks (FFNNs), which can generate smooth speech parameter sequences without post-processing. In statistical parametric speech synthesis based on deep neural networks (DNNs), maximum likelihood parameter generation (MLPG) or recurrent neural networks (RNNs) are generally used to generate smooth speech parameter sequences. However, because the MLPG process requires utterance-level processing, it is not suitable for speech synthesis requiring low latency. Furthermore, networks such as long short-term memory RNNs (LSTM-RNNs) have high computational costs. As RNNs are not recommended in limited computational resource situations, we look at employing FFNNs as an alternative. One limitation of FFNNs is that they train to ignore relationships between speech parameters in adjacent frames. To overcome this limitation and generate smooth speech parameter sequences from FFNNs alone, we propose a novel loss function that uses long-and short-term features from speech parameters. We evaluated the proposed loss function with a focus on the fundamental frequency (F0) at found that, using the proposed loss function, an FFNN-only approach can generate F0 contours that are perceptually equal to or better in terms of naturalness than those generated by MLPG or LSTM-RNNs.
Recently, the effectiveness of text-to-speech (TTS) systems combined with neural vocoders to generate high-fidelity speech has been shown. However, collecting the required training data and building these advanced systems from scratch are time and resource consuming. A more economical approach is to develop a neural vocoder to enhance the speech generated by existing TTS systems. Nonetheless, this approach usually suffers from two issues: 1) temporal mismatches between TTS and natural waveforms and 2) acoustic mismatches between training and testing data. To address these issues, we adopt a cyclic voice conversion (VC) model to generate temporally matched pseudo-VC data for training and acoustically matched enhanced data for testing the neural vocoders. Because of the generality, this framework can be applied to arbitrary neural vocoders. In this paper, we apply the proposed method with a state-of-theart WaveNet vocoder for two different TTS systems, and both objective and subjective experimental results confirm the effectiveness of the proposed framework.
Deep neural network (DNN)-based speech synthesis became popular in recent years and is expected to soon be widely used in embedded devices and environments with limited computing resources. The key intention of these systems in poor computing environments is to reduce the computational cost of generating speech parameter sequences while maintaining voice quality. However, reducing computational costs is challenging for two primary conventional DNN-based methods used for modeling speech parameter sequences. In feed-forward neural networks (FFNNs) with maximum likelihood parameter generation (MLPG), the MLPG reconstructs the temporal structure of the speech parameter sequences ignored by FFNNs but requires additional computational cost according to the sequence length. In recurrent neural networks, the recursive structure allows for the generation of speech parameter sequences while considering temporal structures without the MLPG, but increases the computational cost compared to FFNNs. We propose a new approach for DNNs to acquire parameters captured from the temporal structure by backpropagating the errors of multiple attributes of the temporal sequence via the loss function. This method enables FFNNs to generate speech parameter sequences by considering their temporal structure without the MLPG. We generated the fundamental frequency sequence and the mel-cepstrum sequence with our proposed method and conventional methods, and then synthesized and subjectively evaluated the speeches from these sequences. The proposed method enables even FFNNs that work on a frame-by-frame basis to generate speech parameter sequences by considering the temporal structure and to generate sequences perceptually superior to those from the conventional methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.