Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS

Kurihara, Kiyoshi; Seiyama, Nobumasa; Kumano, Tadashi

doi:10.1587/transinf.2020edp7104

Cited by 21 publications

(8 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The accent nucleus position is the mora just before the pitch descends in the accent phrase. These features are not explicitly written in Japanese raw text; however, they are important for prosodic naturalness in Japanese TTS systems [3,4]. AP consists of two parts: APBP and ANPP.…”

Section: Accent Prediction (Ap)mentioning

confidence: 99%

“…We stopped training when the learning rate fell below 10 −4 . As implicit features, the BERT-base model and the Flair model which are pre-trained on Japanese Wikipedia were used 3,4 . When BERT was used as an implicit feature, the last four layers were concatenated 5 .…”

Section: Implicit Featuresmentioning

confidence: 99%

“…One of the difficulties in TTS systems is the high language dependency of the front-end. Recent studies have shown that end-to-end TTS systems, especially Japanese ones, require not only phonetic but also prosodic information in order to achieve high quality speech [3,4]. To enable the TTS front-end to precisely estimate both phonetic and prosodic information, two problems must be solved: polyphone disambiguation (PD) and accent prediction (AP).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Hida¹,

Hamada²,

Kamada³

et al. 2022

Preprint

View full text Add to dashboard Cite

Although end-to-end text-to-speech (TTS) models can generate natural speech, challenges still remain when it comes to estimating sentence-level phonetic and prosodic information from raw text in Japanese TTS systems. In this paper, we propose a method for polyphone disambiguation (PD) and accent prediction (AP). The proposed method incorporates explicit features extracted from morphological analysis and implicit features extracted from pre-trained language models (PLMs). We use BERT and Flair embeddings as implicit features and examine how to combine them with explicit features. Our objective evaluation results showed that the proposed method improved the accuracy by 5.7 points in PD and 6.0 points in AP. Moreover, the perceptual listening test results confirmed that a TTS system employing our proposed model as a front-end achieved a mean opinion score close to that of synthesized speech with groundtruth pronunciation and accent in terms of naturalness.

show abstract

Section: Accent Prediction (Ap)mentioning

confidence: 99%

Section: Implicit Featuresmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Hida¹,

Hamada²,

Kamada³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…We followed the recipe in egs2/jsut/tts1, using 7,196 utterances for training, 250 for validation, and 250 for evaluation. We used the G2P function based on Open JTalk enhanced with prosody symbols [46] for all models. We compared the following architectures: Tacotron 2 Tacotron 2 + HiFi-GAN.…”

Section: Japanese Single Speakermentioning

confidence: 99%

ESPnet2-TTS: Extending the Edge of TTS Research

Hayashi,

Yamamoto,

Yoshimura

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper describes ESPnet2-TTS, an end-to-end text-to-speech (E2E-TTS) toolkit. ESPnet2-TTS extends our earlier version, ESPnet-TTS, by adding many new features, including: on-thefly flexible pre-processing, joint training with neural vocoders, and state-of-the-art TTS models with extensions like full-band E2E textto-waveform modeling, which simplify the training pipeline and further enhance TTS performance. The unified design of our recipes enables users to quickly reproduce state-of-the-art E2E-TTS results. We also provide many pre-trained models in a unified Python interface for inference, offering a quick means for users to generate baseline samples and build demos. Experimental evaluations with English and Japanese corpora demonstrate that our provided models synthesize utterances comparable to ground-truth ones, achieving state-of-the-art TTS performance. The toolkit is available online at https://github.com/espnet/espnet.

show abstract

“…In [1], data augmentation is applied to extend the voice range in terms of F0 and duration, and note embeddings are used in parallel to the phoneme sequence, to pursue singing synthesis. [17] inserts prosodic symbols to the phoneme sequence to model accents, pauses, and sentence endings.…”

Section: Related Workmentioning

confidence: 99%

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Christidou,

Vioni,

Ellinas

et al. 2021

Preprint

View full text Add to dashboard Cite

This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering. An autoregressive attention-based model is used, incorporating multispeaker architecture modules in parallel to a prosody encoder. Several improvements over the basic single-speaker method are proposed that increase the prosodic control range and coverage. More specifically we employ data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clustering. These modifications enable fine-grained phoneme-level prosody control for all speakers contained in the training set, while maintaining the speaker identity. The model is also fine-tuned to unseen speakers with limited amounts of data and it is shown to maintain its prosody control capabilities, verifying that the speaker-independent prosodic clustering is effective. Experimental results verify that the model maintains high output speech quality and that the proposed method allows efficient prosody control within each speaker's range despite the variability that a multispeaker setting introduces.

show abstract

Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS

Cited by 21 publications

References 17 publications

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

ESPnet2-TTS: Extending the Edge of TTS Research

Improved Prosodic Clustering for Multispeaker and Speaker-independent Phoneme-level Prosody Control

Contact Info

Product

Resources

About