Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control

Christidou, Myrsini; Vioni, Alexandra; Ellinas, Nikolaos; Vamvoukakis, Georgios; Markopoulos, Konstantinos; Kakoulidis, Panos; Sung, June Sig; Park, Hyoung-Min; Chalamandaris, Aimilios; Tsiakoulis, Pirros

doi:10.1007/978-3-030-87802-3_11

Cited by 2 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The learned labels provide great controllability in synthesized speech, however they are bounded by the speaker's range, since the outermost clusters may contain extreme values which are not frequent in the training data. The proposed method is directly applied to multispeaker TTS and enables phoneme-level prosody control for every speaker included in the training set [47]. We also introduce a prosody predictor module to enable end-to-end TTS without the need of reference audio or manually selected labels.…”

Section: Proposed Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Ellinas

Christidou²,

Vioni³

et al. 2023

Speech Communication

View full text Add to dashboard Cite

Section: Proposed Methodsmentioning

confidence: 99%

“…The acoustic model is based on our previous work [55,46] adapted to a multispeaker architecture [47]. On the decoder side, the attention RNN produces a hidden state ℎ 𝑖 which is used as a query in the attention mechanism for calculating the context vector 𝑐 𝑖 representing phoneme information.…”

Section: Acoustic Model Architecturementioning

confidence: 99%

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Ellinas

Christidou²,

Vioni³

et al. 2023

Speech Communication

View full text Add to dashboard Cite

“…M-U model [8] offers an alternative option for finetuning on speech data but input pitch values are quantized allowing limited control and the vocoder is trained on singing data. Our previous work [9] explores singing-data-free training by combining a TTS prosody control model [10] with a post-processing DSP module, resulting to a melodic voice generation of high quality but with limited pitch variation.…”

Section: Related Workmentioning

confidence: 99%

Karaoker: Alignment-free singing voice synthesis with speech training data

Kakoulidis¹,

Ellinas²,

Vamvoukakis³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Existing singing voice synthesis models (SVS) are usually trained on singing data and depend on either error-prone timealignment and duration features or explicit music score information. In this paper, we propose Karaoker, a multispeaker Tacotron-based model conditioned on voice characteristic features that is trained exclusively on spoken data without requiring time-alignments. Karaoker synthesizes singing voice following a multi-dimensional template extracted from a source waveform of an unseen speaker/singer. The model is jointly conditioned with a single deep convolutional encoder on continuous data including pitch, intensity, harmonicity, formants, cepstral peak prominence and octaves. We extend the text-tospeech training objective with feature reconstruction, classification and speaker identification tasks that guide the model to an accurate result. Except for multi-tasking, we also employ a Wasserstein GAN training scheme as well as new losses on the acoustic model's output to further refine the quality of the model.

show abstract

Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control

Cited by 2 publications

References 26 publications

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Controllable speech synthesis by learning discrete phoneme-level prosodic representations

Karaoker: Alignment-free singing voice synthesis with speech training data

Contact Info

Product

Resources

About