2021
DOI: 10.48550/arxiv.2111.04040
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Sung-Feng Huang,
Chyi-Jiunn Lin,
Da-Rong Liu
et al.

Abstract: Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user's voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-tospeech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to a… Show more

Help me understand this report
View published versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 27 publications
0
3
0
Order By: Relevance
“…Previous work (Ba et al, 2016) found that layer normalization could greatly influence the hidden activation and final prediction with a light-weight learnable scale vector γ and bias vector β: LN(x) = γ x−µ σ + β, where µ and σ are the mean and variance of hidden vector x. (Huang et al, 2021;Chen et al, 2020a) further proposed conditional layer normalization for speaker adaptation CLN(x, w) = γ(w) x−µ σ + β(w), which can adaptively perform scaling and shifting of the normalized input features based on the style embedding. Here two simple linear layers E γ and E δ take style embedding w as input and output the scale and bias vector respectively:…”
Section: Mix-style Layer Normalizationmentioning
confidence: 99%
See 1 more Smart Citation
“…Previous work (Ba et al, 2016) found that layer normalization could greatly influence the hidden activation and final prediction with a light-weight learnable scale vector γ and bias vector β: LN(x) = γ x−µ σ + β, where µ and σ are the mean and variance of hidden vector x. (Huang et al, 2021;Chen et al, 2020a) further proposed conditional layer normalization for speaker adaptation CLN(x, w) = γ(w) x−µ σ + β(w), which can adaptively perform scaling and shifting of the normalized input features based on the style embedding. Here two simple linear layers E γ and E δ take style embedding w as input and output the scale and bias vector respectively:…”
Section: Mix-style Layer Normalizationmentioning
confidence: 99%
“…AdaSpeech (Chen et al, 2020a) adapts new voice by finetuning on the limited adaptation data with diverse acoustic conditions. Several works (Min et al, 2021;Huang et al, 2021) adopt meta-learning to adapt to new speakers that have not been seen during training.…”
Section: Introductionmentioning
confidence: 99%
“…There are many ways of adapting a multispeaker model to a new speaker, for example fine-tuning [10,11] is a standard approach that uses the target speaker's data to continue training of the base model. In [12] a multi-stage speaker adaptation method is also pro-posed, whereas in [13] meta-learning is used in order to increase the generalization capability of the model. Adaptation is also shown to work effectively in multilingual setups [14,15].…”
Section: Related Workmentioning
confidence: 99%