Sample Efficient Adaptive Text-to-Speech

Chen, Yutian; Shillingford, Brendan; Budden, David; Reed, Scott; Zen, Heiga; Wang, Quan; Cobo, Luis C.; Trask, Andrew; Laurie, Ben; Gülçehre, Çağlar; Oord, Aäron van den; Vinyals, Oriol; Freitas, Nando de

doi:10.48550/arxiv.1809.10460

Cited by 32 publications

(57 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…SEA-TTS [9] uses Wavenet [1] as its basic architecture and compares the two voice cloning approaches. As mentioned by [8] that adapting the whole model might result in overfitting, SEA-TTS proposed two techniques to deal with it.…”

Section: A Experimental Resultsmentioning

confidence: 99%

“…There are two general approaches to deal with such task [8], speaker adaptation [8], [9], [10], [11], [12] and speaker encoding [6], [13], [7], [14], [15], [16], [17]. The speaker encoding method builds a multi-speaker TTS architecture which consists of a speaker encoder and a TTS model.…”

mentioning

confidence: 99%

“…The speaker encoding method builds a multi-speaker TTS architecture which consists of a speaker encoder and a TTS model. The speaker encoder could be pre-trained [6] or jointly trained [8], [9] with the TTS model. In order to clone the voice of an unseen speaker, the speaker encoder extracts the speaker's embedding from a few speech samples.…”

mentioning

confidence: 99%

“…However, the speaker encoder might suffer from generalization problems and adapt worse for unseen speakers. On the other hand, the multi-speaker TTS architecture includes a speaker embedding table instead of a speaker encoder for the speaker adaptation method [8], [9], [11]. The speaker embedding table is jointly trained with the TTS model.…”

mentioning

confidence: 99%

“…When learning an unseen speaker's voice, we would first randomly initialize an embedding for the new speaker, and then the embedding would be fine-tuned by the speech samples of the unseen speaker alone or with the TTS model. As experimented in [8], [9], although the speaker adaptation approach performs better than the speaker encoding approach, it requires thousands of adaptation steps, which means more cloning time and computational resources are needed for high-quality voice cloning.…”

mentioning

confidence: 99%

See 4 more Smart Citations

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Huang,

Lin,

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

Personalizing a speech synthesis system is a highly desired application, where the system can generate speech with the user's voice with rare enrolled recordings. There are two main approaches to build such a system in recent works: speaker adaptation and speaker encoding. On the one hand, speaker adaptation methods fine-tune a trained multi-speaker text-tospeech (TTS) model with few enrolled samples. However, they require at least thousands of fine-tuning steps for high-quality adaptation, making it hard to apply on devices. On the other hand, speaker encoding methods encode enrollment utterances into a speaker embedding. The trained TTS model can synthesize the user's speech conditioned on the corresponding speaker embedding. Nevertheless, the speaker encoder suffers from the generalization gap between the seen and unseen speakers.In this paper, we propose applying a meta-learning algorithm to the speaker adaptation method. More specifically, we use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly. Therefore, we can also adapt the meta-trained TTS model to unseen speakers efficiently. Our experiments compare the proposed method (Meta-TTS) with two baselines: a speaker adaptation method baseline and a speaker encoding method baseline. The evaluation results show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline and outperforms the speaker encoding baseline under the same training scheme. When the speaker encoder of the baseline is pre-trained with extra 8371 speakers of data, Meta-TTS can still outperform the baseline on LibriTTS dataset and achieve comparable results on VCTK dataset.

show abstract

Section: A Experimental Resultsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 3 more Smart Citations

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Huang,

Lin,

Liu

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

Conventional and contemporary approaches used in text to speech synthesis: a review

Kaur

Singh

2022

Artif Intell Rev

View full text Add to dashboard Cite

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

Gârbacea

Oord

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

In order to efficiently transmit and store speech signals, speech codecs create a minimally redundant representation of the input signal which is then decoded at the receiver with the best possible perceptual quality. In this work we demonstrate that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality. A prosody-transparent and speaker-independent model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits perceptual quality which is around halfway between the MELP codec at 2.4 kbps and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality recorded speech with the test speaker included in the training set, a model coding speech at 1.6 kbps produces output of similar perceptual quality to that generated by AMR-WB at 23.05 kbps.

show abstract

Sample Efficient Adaptive Text-to-Speech

Cited by 32 publications

References 26 publications

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech

Conventional and contemporary approaches used in text to speech synthesis: a review

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

Contact Info

Product

Resources

About