2012
DOI: 10.1109/tasl.2011.2169787
|View full text |Cite
|
Sign up to set email alerts
|

The Deterministic Plus Stochastic Model of the Residual Signal and Its Applications

Abstract: Speech generated by parametric synthesizers generally suffers from a typical buzziness, similar to what was encountered in old LPC-like vocoders. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by the maximum voiced frequency. The deterministic part concerns the low-frequen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
112
0
1

Year Published

2014
2014
2018
2018

Publication Types

Select...
4
3
1

Relationship

3
5

Authors

Journals

citations
Cited by 93 publications
(115 citation statements)
references
References 36 publications
2
112
0
1
Order By: Relevance
“…Experiments in [26] have shown that a mean glottal flow pulse (similar to eigenresidual in [17]) was rated better in quality than excitation using selection of natural pulses and equal to a pulse reconstructed from 12 PCA components. The latter comparison was also informally done using residual waveform in [17] with the same conclusion that using more components for modeling does not improve quality. In creaky voice synthesis [27], the type of deterministic waveform has also been shown to have relevant perceptual effect.…”
Section: Periodic Waveformmentioning
confidence: 99%
See 1 more Smart Citation
“…Experiments in [26] have shown that a mean glottal flow pulse (similar to eigenresidual in [17]) was rated better in quality than excitation using selection of natural pulses and equal to a pulse reconstructed from 12 PCA components. The latter comparison was also informally done using residual waveform in [17] with the same conclusion that using more components for modeling does not improve quality. In creaky voice synthesis [27], the type of deterministic waveform has also been shown to have relevant perceptual effect.…”
Section: Periodic Waveformmentioning
confidence: 99%
“…In [15], a hybrid approach makes use of a codebook of pitch-synchronous residual frames which are selected at synthesis time according to the down-sampled version of the excitation. In [16,17], the deterministic plus stochastic model (DSM) of the residual signal is proposed. DSM excitation consists of two components: the deterministic waveform called eigenresidual, which is obtained by principal component analysis (PCA) on a set of pitch-synchronous residual frames, and an aperiodic excitation delimited by maximum voiced frequency Fm and modulated in time according to a speaker-specific time envelope.…”
Section: Introductionmentioning
confidence: 99%
“…For the filter, we extracted the traditional Mel Generalized Cepstral (MGC) coefficients (with α = 0.42, γ = 0 and order of MGC analysis = 24). For the excitation, we used the Deterministic plus Stochastic Model (DSM) of the residual signal proposed in [11], since it was shown to significantly improve the naturalness of the delivered speech. More precisely, both deterministic and stochastic components of DSM were estimated on the training dataset for each degree of articulation.…”
Section: Conception Of the Speech Synthesizersmentioning
confidence: 99%
“…However, these advantages are achieved at the expense of one major disadvantage, i.e., degradation in the quality of synthetic speech [1]. This shortcoming results from three important factors: vocoding distortion [10][11][12][13], accuracy of statistical models [14][15][16][17][18][19][20][21][22][23][24][25], and accuracy of parameter generation algorithms [26][27][28]. This paper is an attempt to alleviate the second factor and improve the accuracy of statistical models.…”
Section: Introductionmentioning
confidence: 99%
“…In the training phase, first acoustic and contextual factors are extracted for the whole training database using a vocoder [12,29,30] and a natural language pre-processor. Next, the relationship between acoustic and contextual factors is modeled using a context-dependent statistical approach [14][15][16][17][18][19][20][21][22][23][24][25].…”
Section: Introductionmentioning
confidence: 99%