1998
DOI: 10.1109/89.668817
|View full text |Cite
|
Sign up to set email alerts
|

An RNN-based prosodic information synthesizer for Mandarin text-to-speech

Abstract: A new RNN-based prosodic information synthesizer for Mandarin Chinese text-to-speech (TTS) is proposed in this paper. Its four-layer recurrent neural network (RNN) generates prosodic information such as syllable pitch contours, syllable energy levels, syllable initial and final durations, as well as intersyllable pause durations. The input layer and first hidden layer operate with a word-synchronized clock to represent currentword phonologic states within the prosodic structure of text to be synthesized. The s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
17
0

Year Published

2006
2006
2019
2019

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 130 publications
(17 citation statements)
references
References 45 publications
0
17
0
Order By: Relevance
“…Recurrent neural networks (RNNs) are a specific neural topology with feedback connections that allow modeling a memory component, which tracks activations in time in addition to the classic feed-forward path from input to output. They have thus been used as deep architectures that effectively predict either prosodic features [23,24], or duration and acoustic features [7,[25][26][27]. Some of these works also investigate possible performance differences using different RNN cell types, such as long short-term memory (LSTM) or gated recurrent unit modules [28].…”
Section: Introductionmentioning
confidence: 99%
“…Recurrent neural networks (RNNs) are a specific neural topology with feedback connections that allow modeling a memory component, which tracks activations in time in addition to the classic feed-forward path from input to output. They have thus been used as deep architectures that effectively predict either prosodic features [23,24], or duration and acoustic features [7,[25][26][27]. Some of these works also investigate possible performance differences using different RNN cell types, such as long short-term memory (LSTM) or gated recurrent unit modules [28].…”
Section: Introductionmentioning
confidence: 99%
“…Discrete orthogonal polynomials are widely used to represent syllabic pitch contours of Mandarin [1][2][3][4][5][6] and Chinese dialects [7]. In Chen and Wang's study of vector quantization of pitch information for Mandarin [1], the pitch contour of each syllable is parameterized by a 3-rd order discrete orthogonal polynomial expansion expressed by…”
Section: Introductionmentioning
confidence: 99%
“…The phone duration modeling approaches are divided in two major categories: The rule-based (Klatt, 1979) and the data-driven methods (Mobius and Santen, 1996;Santen, 1992;Chen et al, 1998;Chien and Huang, 2003;Lazaridis et al, 2007). In the rulebased methods manually produced rules, extracted from experimental studies on large sets of utterances or based on previous knowledge, are utilized for determining the duration of segments.…”
Section: Introductionmentioning
confidence: 99%
“…Over the last years various statistical methods have been applied in the phone duration modeling task such as, Linear Regression (LR) (Takeda et al, 1989), decisions tree-based models (Mobius and Santen, 1996), Sums-Of-Products (SOP) (Santen, 1992). Artificial Neural Networks (ANN) techniques (Chen et al, 1998), Bayesian models (Chien and Huang, 2003) and instance-based algorithms (Lazaridis et al, 2007) have also been introduced on the phone duration modeling task. Consequently the data-driven approaches offer us the ability to overcome the time consuming labor of the manual extraction of the rules which are needed in the rule-based approaches.…”
Section: Introductionmentioning
confidence: 99%