2021
DOI: 10.1587/transinf.2020edp7104
|View full text |Cite
|
Sign up to set email alerts
|

Prosodic Features Control by Symbols as Input of Sequence-to-Sequence Acoustic Modeling for Neural TTS

Abstract: This paper describes a method to control prosodic features using phonetic and prosodic symbols as input of attention-based sequenceto-sequence (seq2seq) acoustic modeling (AM) for neural text-to-speech (TTS). The method involves inserting a sequence of prosodic symbols between phonetic symbols that are then used to reproduce prosodic acoustic features, i.e. accents, pauses, accent breaks, and sentence endings, in several seq2seq AM methods. The proposed phonetic and prosodic labels have simple descriptions and… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 21 publications
(8 citation statements)
references
References 17 publications
0
8
0
Order By: Relevance
“…The accent nucleus position is the mora just before the pitch descends in the accent phrase. These features are not explicitly written in Japanese raw text; however, they are important for prosodic naturalness in Japanese TTS systems [3,4]. AP consists of two parts: APBP and ANPP.…”
Section: Accent Prediction (Ap)mentioning
confidence: 99%
See 2 more Smart Citations
“…The accent nucleus position is the mora just before the pitch descends in the accent phrase. These features are not explicitly written in Japanese raw text; however, they are important for prosodic naturalness in Japanese TTS systems [3,4]. AP consists of two parts: APBP and ANPP.…”
Section: Accent Prediction (Ap)mentioning
confidence: 99%
“…We stopped training when the learning rate fell below 10 −4 . As implicit features, the BERT-base model and the Flair model which are pre-trained on Japanese Wikipedia were used 3,4 . When BERT was used as an implicit feature, the last four layers were concatenated 5 .…”
Section: Implicit Featuresmentioning
confidence: 99%
See 1 more Smart Citation
“…We followed the recipe in egs2/jsut/tts1, using 7,196 utterances for training, 250 for validation, and 250 for evaluation. We used the G2P function based on Open JTalk enhanced with prosody symbols [46] for all models. We compared the following architectures: Tacotron 2 Tacotron 2 + HiFi-GAN.…”
Section: Japanese Single Speakermentioning
confidence: 99%
“…In [1], data augmentation is applied to extend the voice range in terms of F0 and duration, and note embeddings are used in parallel to the phoneme sequence, to pursue singing synthesis. [17] inserts prosodic symbols to the phoneme sequence to model accents, pauses, and sentence endings.…”
Section: Related Workmentioning
confidence: 99%