Cross-lingual Multispeaker Text-to-Speech under Limited-Data Scenario

Cai, Zexin; Yang, Yaogen; Li, Ming

doi:10.48550/arxiv.2005.10441

Cited by 2 publications

(1 citation statement)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Synthesizing speech from multiple speakers with the use of learnable speaker embeddings has been thoroughly examined from the very start of neural TTS [4] up to most recent efforts [5]. Controlling language with learnable embeddings is also straightforward [6,7] and recently, the concept of metalearning has been shown effective for this purpose [8]. In order to avert the inherent problem of language-dependent speaker representations, domain adaptation has been utilized [9].…”

Section: Introductionmentioning

confidence: 99%

Cross-lingual Low Resource Speaker Adaptation Using Phonological Features

Maniati,

Ellinas,

Markopoulos

et al. 2021

Preprint

View full text Add to dashboard Cite

The idea of using phonological features instead of phonemes as input to sequence-to-sequence TTS has been recently proposed for zero-shot multilingual speech synthesis. This approach is useful for code-switching, as it facilitates the seamless uttering of foreign text embedded in a stream of native text. In our work, we train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages, with the goal of achieving cross-lingual speaker adaptation. We first experiment with the effect of language phonological similarity on cross-lingual TTS of several source-target language combinations. Subsequently, we finetune the model with very limited data of a new speaker's voice in either a seen or an unseen language, and achieve synthetic speech of equal quality, while preserving the target speaker's identity. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature. In the extreme case of only 2 available adaptation utterances, we find that our model behaves as a few-shot learner, as the performance is similar in both the seen and unseen adaptation language scenarios.

show abstract