Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-848
|View full text |Cite
|
Sign up to set email alerts
|

Reducing Mismatch in Training of DNN-Based Glottal Excitation Models in a Statistical Parametric Text-to-Speech System

Abstract: (2017). Reducing mismatch in training of DNN-based glottal excitation models in a statistical parametric text-to-speech system. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH (Vol. 2017-August, pp. 1368-1372 AbstractNeural network-based models that generate glottal excitation waveforms from acoustic features have been found to give improved quality in statistical parametric speech synthesis. Until now, however, these models have been trained separate… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2017
2017
2020
2020

Publication Types

Select...
4
1

Relationship

3
2

Authors

Journals

citations
Cited by 6 publications
(6 citation statements)
references
References 19 publications
0
6
0
Order By: Relevance
“…Alignments between the linguistic and acoustic features are found using the HMMbased speech synthesis system (HTS) [28] and we use a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) for the acoustic model. For system details, see [29]. This paper uses a common acoustic feature set of glottal vocoder features [22] for all neural vocoders: 30 vocal tract filter line spectral frequencies (LSFs), 10 glottal source spectral envelope LSFs, 5 harmonic-to-noise ratio (HNR) parameters, fundamental frequency value on mel-scale (interpolated over unvoiced frames) and a binary voicing flag.…”
Section: Speech Synthesis Systemmentioning
confidence: 99%
See 1 more Smart Citation
“…Alignments between the linguistic and acoustic features are found using the HMMbased speech synthesis system (HTS) [28] and we use a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) for the acoustic model. For system details, see [29]. This paper uses a common acoustic feature set of glottal vocoder features [22] for all neural vocoders: 30 vocal tract filter line spectral frequencies (LSFs), 10 glottal source spectral envelope LSFs, 5 harmonic-to-noise ratio (HNR) parameters, fundamental frequency value on mel-scale (interpolated over unvoiced frames) and a binary voicing flag.…”
Section: Speech Synthesis Systemmentioning
confidence: 99%
“…Alignments between the linguistic and acoustic features are found using the HMMbased speech synthesis system (HTS) [28] and we use a bidirectional long short-term memory (BLSTM) recurrent neural network (RNN) for the acoustic model. For system details, see [29].…”
Section: Speech Synthesis Systemmentioning
confidence: 99%
“…The idea of using an MbG structure is not new. In a study of parametric glottal vocoders, Juvela et al [12] first proposed the closed-loop extraction of glottal excitation from the generated spectral parameters, and our own previous work proposed the MbG structure to compensate for missing noise components in generated glottal signals [13]. However, it was not possible to fully utilize the effectiveness of the MbG training strategy because our experiments were only performed with simple deep learning models including stacked feed-forward and/or long short-term memory (LSTM) networks.…”
Section: Related Workmentioning
confidence: 99%
“…In the current experiments, GlottDNN uses the glottal excitation model configuration described in [47]. To simplify comparisons, the acoustic features used for the TTS acoustic model targets and neural vocoder inputs is the GlottDNN acoustic feature set (for both the GlotNet and WaveNet neural vocoders).…”
Section: B Glottal Vocodersmentioning
confidence: 99%