2021
DOI: 10.1007/978-3-030-87802-3_11
|View full text |Cite
|
Sign up to set email alerts
|

Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
1
1

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…The learned labels provide great controllability in synthesized speech, however they are bounded by the speaker's range, since the outermost clusters may contain extreme values which are not frequent in the training data. The proposed method is directly applied to multispeaker TTS and enables phoneme-level prosody control for every speaker included in the training set [47]. We also introduce a prosody predictor module to enable end-to-end TTS without the need of reference audio or manually selected labels.…”
Section: Proposed Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…The learned labels provide great controllability in synthesized speech, however they are bounded by the speaker's range, since the outermost clusters may contain extreme values which are not frequent in the training data. The proposed method is directly applied to multispeaker TTS and enables phoneme-level prosody control for every speaker included in the training set [47]. We also introduce a prosody predictor module to enable end-to-end TTS without the need of reference audio or manually selected labels.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…The acoustic model is based on our previous work [55,46] adapted to a multispeaker architecture [47]. On the decoder side, the attention RNN produces a hidden state β„Ž 𝑖 which is used as a query in the attention mechanism for calculating the context vector 𝑐 𝑖 representing phoneme information.…”
Section: Acoustic Model Architecturementioning
confidence: 99%
“…M-U model [8] offers an alternative option for finetuning on speech data but input pitch values are quantized allowing limited control and the vocoder is trained on singing data. Our previous work [9] explores singing-data-free training by combining a TTS prosody control model [10] with a post-processing DSP module, resulting to a melodic voice generation of high quality but with limited pitch variation.…”
Section: Related Workmentioning
confidence: 99%