2020
DOI: 10.1109/taslp.2019.2950099
|View full text |Cite
|
Sign up to set email alerts
|

A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $F_0$ Model for Statistical Parametric Speech Synthesis

Abstract: Recurrent neural networks (RNNs) can predict fundamental frequency (F0) for statistical parametric speech synthesis systems, given linguistic features as input. However, these models assume conditional independence between consecutive F0 values, given the RNN state. In a previous study, we proposed autoregressive (AR) neural F0 models to capture the causal dependency of successive F0 values. In subjective evaluations, a deep AR model (DAR) outperformed an RNN. Here, we propose a Vector Quantized Variational Au… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
26
0

Year Published

2020
2020
2025
2025

Publication Types

Select...
5
2

Relationship

1
6

Authors

Journals

citations
Cited by 33 publications
(26 citation statements)
references
References 27 publications
0
26
0
Order By: Relevance
“…Most prosody representations are learnt at the sentence-level [16,17]. However, these are too coarse and are not able to perfectly reconstruct prosody [18]. To accurately capture prosody we need a sequence of representations, e.g.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations
“…Most prosody representations are learnt at the sentence-level [16,17]. However, these are too coarse and are not able to perfectly reconstruct prosody [18]. To accurately capture prosody we need a sequence of representations, e.g.…”
Section: Introductionmentioning
confidence: 99%
“…To accurately capture prosody we need a sequence of representations, e.g. word or phrase level [18,19]. Following CopyCat [20], we learn a sequence of representations from the mel-spectrogram, but we do so at the word-level (stage-1).…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…VQ-VAE [14] has been applied to various speech synthesis tasks, including diverse and controllable TTS [17,18], a new TTS framework based on symbol-to-symbol translation [19], speech coding [20], voice conversion [21], and representation learning [22,23,24,25].…”
Section: Vector Quantized Autoencoder For Speech Tasksmentioning
confidence: 99%
“…The VAEs have met with great success in recent years in several applicative areas including anomaly detection [6][7][8][9], text classification [10], sentence generation [11], speech synthesis and recognition [12][13][14], spatio-temporal solar irradiance forecasting [15] and in geoscience for data assimilation [2]. In other respects, the two major application areas of the VAEs are the biomedical and healthcare recommendation [16][17][18][19], and industrial applications for nonlinear processes monitoring [1,3,4,[20][21][22][23][24][25].…”
Section: Introductionmentioning
confidence: 99%