A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Wang, Xin; Takaki, Shinji; Yamagishi, Junichi; King, Simon; Tokuda, Keiichi

doi:10.1109/taslp.2019.2950099

Cited by 33 publications

(26 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most prosody representations are learnt at the sentence-level [16,17]. However, these are too coarse and are not able to perfectly reconstruct prosody [18]. To accurately capture prosody we need a sequence of representations, e.g.…”

Section: Introductionmentioning

confidence: 99%

“…To accurately capture prosody we need a sequence of representations, e.g. word or phrase level [18,19]. Following CopyCat [20], we learn a sequence of representations from the mel-spectrogram, but we do so at the word-level (stage-1).…”

Section: Introductionmentioning

confidence: 99%

“…Reducing the bit rate of the representation, by using a longer unit length, should allow the learnt prior to capture longer-range effects, such as prosodic patterns. The linguistic linker introduced by Wang et al [18] experiments with different unit lengths when modelling F 0 , but not other aspects of prosody such as rhythm and intensity. Representations extracted from a spectrogram can capture these aspects of prosody [21,17].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Camp: A Two-Stage Approach to Modelling Prosody in Context

Hodari

Moinet

Karlapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

Prosody is an integral part of communication, but remains an open problem in state-of-the-art speech synthesis. There are two major issues faced when modelling prosody: (1) prosody varies at a slower rate compared with other content in the acoustic signal (e.g. segmental information and background noise); (2) determining appropriate prosody without sufficient context is an ill-posed problem. In this paper, we propose solutions to both these issues. To mitigate the challenge of modelling a slow-varying signal, we learn to disentangle prosodic information using a word level representation. To alleviate the ill-posed nature of prosody modelling, we use syntactic and semantic information derived from text to learn a contextdependent prior over our prosodic space. Our context-aware model of prosody (CAMP) outperforms the state-of-the-art technique, closing the gap with natural speech by 26%. We also find that replacing attention with a jointly-trained duration model improves prosody significantly.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Camp: A Two-Stage Approach to Modelling Prosody in Context

Hodari

Moinet

Karlapati

et al. 2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

show abstract

“…VQ-VAE [14] has been applied to various speech synthesis tasks, including diverse and controllable TTS [17,18], a new TTS framework based on symbol-to-symbol translation [19], speech coding [20], voice conversion [21], and representation learning [22,23,24,25].…”

Section: Vector Quantized Autoencoder For Speech Tasksmentioning

confidence: 99%

End-to-End Text-to-Speech Using Latent Duration Based on VQ-VAE

Yasuda

Wang

Yamagishd

2021

ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Explicit duration modeling is a key to achieving robust and efficient alignment in text-to-speech synthesis (TTS). We propose a new TTS framework using explicit duration modeling that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch. We formulate our method based on conditional VQ-VAE to handle discrete duration in a variational autoencoder and provide a theoretical explanation to justify our method. In our framework, a connectionist temporal classification (CTC) -based force aligner acts as the approximate posterior, and text-to-duration works as the prior in the variational autoencoder. We evaluated our proposed method with a listening test and compared it with other TTS methods based on soft-attention or explicit duration modeling. The results showed that our systems rated between soft-attention-based methods (Transformer-TTS, Tacotron2) and explicit duration modeling-based methods (Fastspeech).

show abstract

“…The VAEs have met with great success in recent years in several applicative areas including anomaly detection [6][7][8][9], text classification [10], sentence generation [11], speech synthesis and recognition [12][13][14], spatio-temporal solar irradiance forecasting [15] and in geoscience for data assimilation [2]. In other respects, the two major application areas of the VAEs are the biomedical and healthcare recommendation [16][17][18][19], and industrial applications for nonlinear processes monitoring [1,3,4,[20][21][22][23][24][25].…”

Section: Introductionmentioning

confidence: 99%

Semi-Supervised Adversarial Variational Autoencoder

Zemouri

2020

MAKE

View full text Add to dashboard Cite

We present a method to improve the reconstruction and generation performance of a variational autoencoder (VAE) by injecting an adversarial learning. Instead of comparing the reconstructed with the original data to calculate the reconstruction loss, we use a consistency principle for deep features. The main contributions are threefold. Firstly, our approach perfectly combines the two models, i.e., GAN and VAE, and thus improves the generation and reconstruction performance of the VAE. Secondly, the VAE training is done in two steps, which allows to dissociate the constraints used for the construction of the latent space on the one hand, and those used for the training of the decoder. By using this two-step learning process, our method can be more widely used in applications other than image processing. While training the encoder, the label information is integrated to better structure the latent space in a supervised way. The third contribution is to use the trained encoder for the consistency principle for deep features extracted from the hidden layers. We present experimental results to show that our method gives better performance than the original VAE. The results demonstrate that the adversarial constraints allow the decoder to generate images that are more authentic and realistic than the conventional VAE.

show abstract

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Cited by 33 publications

References 27 publications

Camp: A Two-Stage Approach to Modelling Prosody in Context

Camp: A Two-Stage Approach to Modelling Prosody in Context

End-to-End Text-to-Speech Using Latent Duration Based on VQ-VAE

Semi-Supervised Adversarial Variational Autoencoder

Contact Info

Product

Resources

About