An RNN-based prosodic information synthesizer for Mandarin text-to-speech

Chen, Sin‐Horng; Hwang, Shaw‐Hwa; Wang, Yih-Ru

doi:10.1109/89.668817

Cited by 130 publications

(17 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recurrent neural networks (RNNs) are a specific neural topology with feedback connections that allow modeling a memory component, which tracks activations in time in addition to the classic feed-forward path from input to output. They have thus been used as deep architectures that effectively predict either prosodic features [23,24], or duration and acoustic features [7,[25][26][27]. Some of these works also investigate possible performance differences using different RNN cell types, such as long short-term memory (LSTM) or gated recurrent unit modules [28].…”

Section: Introductionmentioning

confidence: 99%

Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech

2019

View full text Add to dashboard Cite

Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.

show abstract

Section: Introductionmentioning

confidence: 99%

Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech

2019

View full text Add to dashboard Cite

show abstract

“…Discrete orthogonal polynomials are widely used to represent syllabic pitch contours of Mandarin [1][2][3][4][5][6] and Chinese dialects [7]. In Chen and Wang's study of vector quantization of pitch information for Mandarin [1], the pitch contour of each syllable is parameterized by a 3-rd order discrete orthogonal polynomial expansion expressed by…”

Section: Introductionmentioning

confidence: 99%

On Smoothing and Enhancing Dynamics of Pitch Contours Represented by Discrete Orthogonal Polynomials for Prosody Generation

Chiang¹

2016

Interspeech 2016

View full text Add to dashboard Cite

This paper presents a new pitch contour generation algorithm for statistical syllable-based logF0 generation models which represent logF0 contours of syllables by coefficients of discrete orthogonal polynomials, i.e. orthogonal expansion coefficients (OECs). The conventional statistical logF0 models can generate smooth pitch contour within a syllable because of the continuity property of polynomials. However, the models do not ensure to produce continuous and smooth logF0 contours in the proximity of syllable junctures. Besides, dynamic range of the generated logF0 contours is generally smaller than the one of real speech. The above two shortcomings would result in unnatural and monotonous prosody. To overcome these shortcomings, juncture-smooth and dynamics-enhancing OEC generation algorithms are hence proposed in this paper. Analysis on the generated logF0 contours by the proposed algorithm shows some improvements in logF0 smoothness at syllable junctures and enhanced logF0 dynamic range. In addition, a perceptual evaluation of the logF0 contour generated by the proposed algorithm shows an improvement in naturalness of the synthesized speech.

show abstract

“…The phone duration modeling approaches are divided in two major categories: The rule-based (Klatt, 1979) and the data-driven methods (Mobius and Santen, 1996;Santen, 1992;Chen et al, 1998;Chien and Huang, 2003;Lazaridis et al, 2007). In the rulebased methods manually produced rules, extracted from experimental studies on large sets of utterances or based on previous knowledge, are utilized for determining the duration of segments.…”

Section: Introductionmentioning

confidence: 99%

“…Over the last years various statistical methods have been applied in the phone duration modeling task such as, Linear Regression (LR) (Takeda et al, 1989), decisions tree-based models (Mobius and Santen, 1996), Sums-Of-Products (SOP) (Santen, 1992). Artificial Neural Networks (ANN) techniques (Chen et al, 1998), Bayesian models (Chien and Huang, 2003) and instance-based algorithms (Lazaridis et al, 2007) have also been introduced on the phone duration modeling task. Consequently the data-driven approaches offer us the ability to overcome the time consuming labor of the manual extraction of the rules which are needed in the rule-based approaches.…”

Section: Introductionmentioning

confidence: 99%

Comparative Evaluation of Phone Duration Models for Greek Emotional Speech

Lazaridis¹

2010

Journal of Computer Science

View full text Add to dashboard Cite

Problem statement:In this study we cope with the task of phone duration modeling for Greek emotional speech synthesis. Approach: Various well established machine learning techniques are applied for this purpose to an emotional speech database consisting of five archetypal emotions. The constructed phone duration prediction models are built on phonetic, morphosyntactic and prosodic features that can be extracted only from text. We employ model and regression trees, linear regression, lazy learning algorithms and meta-learning algorithms using regression trees as base classifiers, trained on a Modern Greek emotional database consisting of five emotional categories: anger, fear, joy, neutral and sadness. Results: Model trees based on the M5' algorithm and meta-learning algorithms using as base classifier regression trees based on the M5' algorithm proved to perform better. Conclusion: It was observed that the emotional categories of the speech database with the most uniform distribution of phone durations built the most accurate models.

show abstract

An RNN-based prosodic information synthesizer for Mandarin text-to-speech

Cited by 130 publications

References 45 publications

Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech

Exploring Efficient Neural Architectures for Linguistic–Acoustic Mapping in Text-To-Speech

On Smoothing and Enhancing Dynamics of Pitch Contours Represented by Discrete Orthogonal Polynomials for Prosody Generation

Comparative Evaluation of Phone Duration Models for Greek Emotional Speech

Contact Info

Product

Resources

About