2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2017
DOI: 10.1109/icassp.2017.7953087
|View full text |Cite
|
Sign up to set email alerts
|

An autoregressive recurrent mixture density network for parametric speech synthesis

Abstract: Neural-network-based generative models, such as mixture density networks, are potential solutions for speech synthesis. In this paper we follow this path and propose a recurrent mixture density network that incorporates a trainable autoregressive model. An advantage of incorporating an autoregressive model is that the time dependency within acoustic feature trajectories can be modeled without using conventional dynamic features. More interestingly, experiments show that this autoregressive model learns to be a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 43 publications
(32 citation statements)
references
References 18 publications
0
32
0
Order By: Relevance
“…Nevertheless, the ideas presented here still apply while the acoustic features are used to control generative neural models. The shortcomings in the acoustic model can be addressed in the future with the use of more advanced models, such as mixture density network LSTMs [8].…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Nevertheless, the ideas presented here still apply while the acoustic features are used to control generative neural models. The shortcomings in the acoustic model can be addressed in the future with the use of more advanced models, such as mixture density network LSTMs [8].…”
Section: Discussionmentioning
confidence: 99%
“…In conjunction, the two major issues in SPSS, over-smoothness of the generated acoustic parameters, and "buzzy" synthetic sound quality have been attributed to the acoustic model and vocoder, respectively. Recent efforts have improved the acoustic model performance resulting in more natural synthetic parameter trajectories using, for example, autoregressive mixture density networks [8], or generative adversarial network-based post-filtering [9]. However, the performance of these systems is still upper-bounded by the analysis-synthesis quality of the vocoder.…”
Section: Introductionmentioning
confidence: 99%
“…For the acoustic models that predict acoustic features from the linguistic features, we used shallow and deep neural AR models [9], [5] to generate the MGCs and F0, respectively. The recipes for training these acoustic models were the same as those in another of our previous study [54].…”
Section: B Model Configurationsmentioning
confidence: 99%
“…The phase-based weighting matrix is introduced to reconstruct the glottal waveform through weighting the CW component, as shown in equation (4). The equation (3) shows that the weighting matrix function F (•) is a complicated non-linear function of the phase vector Φ(n). Thus, we use two fully connected layers followed by different nonlinear activations to simulate the phase-based weighting function F (•).…”
Section: Hybrid Neural Networkmentioning
confidence: 99%
“…The quality of SPSS system is mainly affected by three factors: vocoder, acoustic model accuracy and over-smoothing [1]. Recently, deep neural networks, especially the sequential neural network [2,3], has improved the model accuracy and alleviate the over-smoothing issue. Despite those improvements, the synthetic speech quality is still limited by the vocoder, which causes the gap between SPSS and unit concatenation approaches.…”
Section: Introductionmentioning
confidence: 99%