2013
DOI: 10.1109/jproc.2013.2251852
|View full text |Cite
|
Sign up to set email alerts
|

Speech Synthesis Based on Hidden Markov Models

Abstract: This paper gives a general overview of hidden Markov model (HMM)-based speech synthesis, which has recently been demonstrated to be very effective in synthesizing speech. The main advantage of this approach is its flexibility in changing speaker identities, emotions, and speaking styles. This paper also discusses the relation between the HMM-based approach and the more conventional unit-selection approach that has dominated over the last decades. Finally, advanced techniques for future developments are describ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
228
0
14

Year Published

2014
2014
2020
2020

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 375 publications
(242 citation statements)
references
References 86 publications
0
228
0
14
Order By: Relevance
“…After applying a weighting matrix W [3] to an input speech parameter sequence x = [x 1 , · · · , x T ] for calculating its static-dynamic speech feature sequence, the DNNs predict a static-dynamic speech feature sequence of the converted speech.ŷ is generated from the static-dynamic features by using the maximum likelihood-based parameter generation algorithm [2]. We define the above speech parameter conversion asŷ = G(x).…”
Section: Conventional Dnn-based Vcmentioning
confidence: 99%
See 1 more Smart Citation
“…After applying a weighting matrix W [3] to an input speech parameter sequence x = [x 1 , · · · , x T ] for calculating its static-dynamic speech feature sequence, the DNNs predict a static-dynamic speech feature sequence of the converted speech.ŷ is generated from the static-dynamic features by using the maximum likelihood-based parameter generation algorithm [2]. We define the above speech parameter conversion asŷ = G(x).…”
Section: Conventional Dnn-based Vcmentioning
confidence: 99%
“…Deep Neural Networks (DNNs) [1] have been used as acoustic models for VC because they can represent the relationship between the input and output speech parameters more accurately than conventional Gaussian mixture models [2]. These acoustic models are trained with training algorithms such as the maximum likelihood criterion [3] and Minimum Generation Error (MGE) criterion [4], [5]. However, the converted speech parameters tend to be oversmoothed, and this phenomenon degrades the quality of the converted speech.…”
Section: Introductionmentioning
confidence: 99%
“…The front-end of a TTS system infers the symbolic representation of both segmental and prosodic properties of speech. Then, the back-end acoustic model converts the symbolic intermediate representation into a speech waveform, typically using the unit-selection method [8] or the statistical parametric method [9].…”
Section: Conventional Front-end Processing Flow For Typical English Tmentioning
confidence: 99%
“…The acoustic model based on the hidden Markov model (HMM) has dominated the parametric speech synthesis method for decades [9]. However, its decision-tree-based model clustering method may not be able to express complex dependency in the input linguistic representation of text [23].…”
Section: Acoustic Modeling Based On the Deep Neural Networkmentioning
confidence: 99%
“…There have been many attempts at developing crosslingual speech synthesis based on statistical voice conversion [9] or Hidden Markov Model (HMM)-based speech synthesis [10]. For example, one-to-many Gaussian Mixture Model (GMM)-based voice conversion can be applied to unsupervised speaker adaptation in cross-lingual speech synthesis [11], [12].…”
mentioning
confidence: 99%