Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Huang, Sung-Feng; Lin, Chyi-Jiunn; Liu, Da-Rong; Chen, Yi-Chen; Lee, Hung-yi

doi:10.48550/arxiv.2111.04040

Cited by 3 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous work (Ba et al, 2016) found that layer normalization could greatly influence the hidden activation and final prediction with a light-weight learnable scale vector γ and bias vector β: LN(x) = γ x−µ σ + β, where µ and σ are the mean and variance of hidden vector x. (Huang et al, 2021;Chen et al, 2020a) further proposed conditional layer normalization for speaker adaptation CLN(x, w) = γ(w) x−µ σ + β(w), which can adaptively perform scaling and shifting of the normalized input features based on the style embedding. Here two simple linear layers E γ and E δ take style embedding w as input and output the scale and bias vector respectively:…”

Section: Mix-style Layer Normalizationmentioning

confidence: 99%

See 1 more Smart Citation

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Huang¹,

Ren²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a textto-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the styleagnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that Gener-Speech performs robustly in the few-shot data setting. Audio samples are available at https: //GenerSpeech.github.io/.

show abstract

Section: Mix-style Layer Normalizationmentioning

confidence: 99%

“…AdaSpeech (Chen et al, 2020a) adapts new voice by finetuning on the limited adaptation data with diverse acoustic conditions. Several works (Min et al, 2021;Huang et al, 2021) adopt meta-learning to adapt to new speakers that have not been seen during training.…”

Section: Introductionmentioning

confidence: 99%

GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech

Huang¹,

Ren²,

Liu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…There are many ways of adapting a multispeaker model to a new speaker, for example fine-tuning [10,11] is a standard approach that uses the target speaker's data to continue training of the base model. In [12] a multi-stage speaker adaptation method is also pro-posed, whereas in [13] meta-learning is used in order to increase the generalization capability of the model. Adaptation is also shown to work effectively in multilingual setups [14,15].…”

Section: Related Workmentioning

confidence: 99%

Self supervised learning for robust voice cloning

Klapsas¹,

Ellinas²,

Nikitaras³

et al. 2022

Preprint

View full text Add to dashboard Cite

Voice cloning is a difficult task which requires robust and informative features incorporated in a high quality TTS system in order to effectively copy an unseen speaker's voice. In our work, we utilize features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when specific audio augmentations are applied to the vanilla algorithm. We further extend the augmentations in the training procedure to aid the resulting features to capture the speaker identity and to make them robust to noise and acoustic conditions. The learned features are used as pre-trained utterance-level embeddings and as inputs to a Non-Attentive Tacotron based architecture, aiming to achieve multispeaker speech synthesis without utilizing additional speaker features. This method enables us to train our model in an unlabeled multispeaker dataset as well as use unseen speaker embeddings to copy a speaker's voice. Subjective and objective evaluations are used to validate the proposed model, as well as the robustness to the acoustic conditions of the target utterance.

show abstract