Investigating Very Deep Highway Networks for Parametric Speech Synthesis

Wang, Xin; Takaki, Shinji; Yamagishi, Junichi

doi:10.21437/ssw.2016-27

Cited by 15 publications

(13 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recently, researchers have used NNs as alternatives to HMMs to jointly model the F 0 and other spectral features [22], [23], [24]. Some researchers use NNs exclusively for F 0 modeling [25], [12], which may be reasonable because it was recently found that NNs may prioritize the spectral features over the F 0 [26]. In fact, many NN-based F 0 models have been proposed before the advent of SPSS-based TTS systems [27], [28], [29].…”

Section: A Classical Modelsmentioning

confidence: 99%

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Wang

Takaki

Yamagishi

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive F0 values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural F0 models to capture the causal dependency of successive F0 values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Autoencoder (VQ-VAE) neural F0 model that is both more efficient and more interpretable than the DAR. This model has two stages: one uses the VQ-VAE framework to learn a latent code for the F0 contour of each linguistic unit, and other learns to map from linguistic features to latent codes. In contrast to the DAR and RNN, which process the input linguistic features frame-by-frame, the new model converts one linguistic feature vector into one latent code for each linguistic unit. The new model achieves better objective scores than the DAR, has a smaller memory footprint and is computationally faster. Visualization of the latent codes for phones and moras reveals that each latent code represents an F0 shape for a linguistic unit.

show abstract

Section: A Classical Modelsmentioning

confidence: 99%

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Wang

Takaki

Yamagishi

et al. 2020

IEEE/ACM Trans. Audio Speech Lang. Process.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Highway networks [8], [9] are weighted skip-connections between layers, and they often connect hidden layers. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, we propose a VC using highway networks connected from the input to output as follows:…”

Section: Vc Using Input-to-output Highway Networkmentioning

confidence: 99%

Voice Conversion Using Input-to-Output Highway Networks

Saito

Takamichi

Saruwatari

2017

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYThis paper proposes Deep Neural Network (DNN)-based Voice Conversion (VC) using input-to-output highway networks. VC is a speech synthesis technique that converts input features into output speech parameters, and DNN-based acoustic models for VC are used to estimate the output speech parameters from the input speech parameters. Given that the input and output are often in the same domain (e.g., cepstrum) in VC, this paper proposes a VC using highway networks connected from the input to output. The acoustic models predict the weighted spectral differentials between the input and output spectral parameters. The architecture not only alleviates over-smoothing effects that degrade speech quality, but also effectively represents the characteristics of spectral parameters. The experimental results demonstrate that the proposed architecture outperforms Feed-Forward neural networks in terms of the speech quality and speaker individuality of the converted speech. key words: statistical parametric speech synthesis, DNN-based voice conversion, highway networks, over-smoothing

show abstract

“…It is well known that the generated sequences of parameters from the HMMs are temporally smoothed, producing perceptual differences between synthetic and natural speech. There have been several attempts to improve the quality of synthesized speech, based on Deep Learning approaches: The first main approach is to substitute the HMM for deep neural networks (DNN) [7] [8] [9] [10], learning the map between linguistic specification directly to speech parameters. The second approach is to apply post-filters for the parameters generated by the HMMs [11] [12] [13].…”

Section: Introductionmentioning

confidence: 99%

Improving Post-Filtering of Artificial Speech Using Pre-Trained LSTM Neural Networks

Coto-Jiménez

2019

Preprint

View full text Add to dashboard Cite

Several researchers have contemplated deep learning-based post-filters to increase the quality of statistical parametric speech synthesis, which perform a mapping of the synthetic speech to the natural speech, considering the different parameters separately and trying to reduce the gap between them. The Long Short-term Memory (LSTM) Neural Networks have been applied successfully in this purpose, but there are still many aspects to improve in the results and in the process itself. In this paper, we introduce a new pre-training approach for the LSTM, with the objective of enhancing the quality of the synthesized speech, particularly in the spectrum, in a more efficient manner. Our approach begins with an auto-associative training of one LSTM network, which is used as an initialization for the post-filters. We show the advantages of this initialization for the enhancing of the Mel-Frequency Cepstral parameters of synthetic speech. Results show that the initialization succeeds in achieving better results in enhancing the statistical parametric speech spectrum in most cases when compared to the common random initialization approach of the networks.

show abstract

Investigating Very Deep Highway Networks for Parametric Speech Synthesis

Cited by 15 publications

References 14 publications

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Voice Conversion Using Input-to-Output Highway Networks

Improving Post-Filtering of Artificial Speech Using Pre-Trained LSTM Neural Networks

Contact Info

Product

Resources

About