Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Zen, Heiga; Braunschweiler, Norbert; Buchholz, Sabine; Gales, Mark J. F.; Knill, Kate; Krstulović, Sacha; Latorre, Javier

doi:10.1109/tasl.2012.2187195

Cited by 82 publications

(59 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Cluster Adaptive Training (CAT) was originally developed for speech recognition to enable rapid speaker adaptation [8]. And in [9] CAT has been extended for statistical parametric synthesis to perform the speaker and language factorization. The CAT model consists of cluster of models and transformation is employed to represent the specific target model.…”

Section: Alternative Statistical Parametric Modelsmentioning

confidence: 99%

Factorized context modelling for Text-to-Speech synthesis

King

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

Because speech units are so context-dependent, a large number of linguistic context features are generally used by HMMbased Text-to-Speech (TTS) speech synthesis systems, via context-dependent models. Since it is impossible to train separate models for every context, decision trees are used to discover the most important combinations of features that should be modelled. The task of the decision tree is very hard -to generalize from a very small observed part of the context feature space to the rest -and they have a major weakness: they cannot directly take advantage of factorial properties: they subdivide the model space based on one feature at a time. We propose a Dynamic Bayesian Network (DBN) based Mixed Memory Markov Model (MMMM) to provide factorization of the context space. The results of a listening test are provided as evidence that the model successfully learns the factorial nature of this space.

show abstract

Section: Alternative Statistical Parametric Modelsmentioning

confidence: 99%

Factorized context modelling for Text-to-Speech synthesis

King

2013

2013 IEEE International Conference on Acoustics, Speech and Signal Processing

View full text Add to dashboard Cite

show abstract

“…In contrast to concatenative synthesis [15], which stores speech waveforms, the parametric representation in SPSS has several potential advantages, including flexibility in changing voice characteristics [3], speaker and style adaptation [16][17][18][19], easier multilingual support [20][21][22], superior coverage of acoustic space [3], reduced memory footprint [3], and better robustness to lowquality speech recordings [23].…”

Section: Introductionmentioning

confidence: 99%

Soft context clustering for F0 modeling in HMM-based speech synthesis

Khorram

Sameti

King

2015

EURASIP J. Adv. Signal Process.

View full text Add to dashboard Cite

This paper proposes the use of a new binary decision tree, which we call a soft decision tree, to improve generalization performance compared to the conventional 'hard' decision tree method that is used to cluster context-dependent model parameters in statistical parametric speech synthesis. We apply the method to improve the modeling of fundamental frequency, which is an important factor in synthesizing natural-sounding high-quality speech. Conventionally, hard decision tree-clustered hidden Markov models (HMMs) are used, in which each model parameter is assigned to a single leaf node. However, this 'divide-and-conquer' approach leads to data sparsity, with the consequence that it suffers from poor generalization, meaning that it is unable to accurately predict parameters for models of unseen contexts: the hard decision tree is a weak function approximator. To alleviate this, we propose the soft decision tree, which is a binary decision tree with soft decisions at the internal nodes. In this soft clustering method, internal nodes select both their children with certain membership degrees; therefore, each node can be viewed as a fuzzy set with a context-dependent membership function. The soft decision tree improves model generalization and provides a superior function approximator because it is able to assign each context to several overlapped leaves. In order to use such a soft decision tree to predict the parameters of the HMM output probability distribution, we derive the smoothest (maximum entropy) distribution which captures all partial first-order moments and a global second-order moment of the training samples. Employing such a soft decision tree architecture with maximum entropy distributions, a novel speech synthesis system is trained using maximum likelihood (ML) parameter re-estimation and synthesis is achieved via maximum output probability parameter generation. In addition, a soft decision tree construction algorithm optimizing a log-likelihood measure is developed. Both subjective and objective evaluations were conducted and indicate a considerable improvement over the conventional method.

show abstract

“…However, working with average voice models is difficult for under-resourced languages since building such general model needs remarkable efforts to design, record, and transcribe a thorough multi-speaker speech database [3]. To alleviate the data sparsity problem in under-resourced languages, speaker and language factorization (SLF) technique can be used [34]. SLF attempts to factorize speaker-specific and language-specific characteristics in training data and then model them using different transforms.…”

Section: Introductionmentioning

confidence: 99%

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Khorram

Sameti

Bahmaninezhad

et al. 2014

J AUDIO SPEECH MUSIC PROC.

View full text Add to dashboard Cite

Decision tree-clustered context-dependent hidden semi-Markov models (HSMMs) are typically used in statistical parametric speech synthesis to represent probability densities of acoustic features given contextual factors. This paper addresses three major limitations of this decision tree-based structure: (i) The decision tree structure lacks adequate context generalization. (ii) It is unable to express complex context dependencies. (iii) Parameters generated from this structure represent sudden transitions between adjacent states. In order to alleviate the above limitations, many former papers applied multiple decision trees with an additive assumption over those trees. Similarly, the current study uses multiple decision trees as well, but instead of the additive assumption, it is proposed to train the smoothest distribution by maximizing entropy measure. Obviously, increasing the smoothness of the distribution improves the context generalization. The proposed model, named hidden maximum entropy model (HMEM), estimates a distribution that maximizes entropy subject to multiple moment-based constraints. Due to the simultaneous use of multiple decision trees and maximum entropy measure, the three aforementioned issues are considerably alleviated. Relying on HMEM, a novel speech synthesis system has been developed with maximum likelihood (ML) parameter re-estimation as well as maximum output probability parameter generation. Additionally, an effective and fast algorithm that builds multiple decision trees in parallel is devised. Two sets of experiments have been conducted to evaluate the performance of the proposed system. In the first set of experiments, HMEM with some heuristic context clusters is implemented. This system outperformed the decision tree structure in small training databases (i.e., 50, 100, and 200 sentences). In the second set of experiments, the HMEM performance with four parallel decision trees is investigated using both subjective and objective tests. All evaluation results of the second experiment confirm significant improvement of the proposed system over the conventional HSMM.

show abstract

Statistical Parametric Speech Synthesis Based on Speaker and Language Factorization

Cited by 82 publications

References 25 publications

Factorized context modelling for Text-to-Speech synthesis

Factorized context modelling for Text-to-Speech synthesis

Soft context clustering for F0 modeling in HMM-based speech synthesis

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Contact Info

Product

Resources

About