Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Finkelstein, Lev; Zen, Heiga; Casagrande, Norman; Chan, Chin-Feng; Jia, Yali; Kenter, Tom; Petelin, Alex; Shen, Jonathan; Wan, Vincent; Wu, Yangjie; Clark, Robert G.

doi:10.21437/interspeech.2022-10115

Cited by 5 publications

(3 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This synthetic corpus along with real corpus is then used to train a non-auto-regressive TTS system. A similar approach utilizing synthetic corpus from existing TTS is explored in (Finkelstein et al, 2022;Song et al, 2022). Our work is similar to these approaches where the common aspect is to generate synthetic audio from another TTS system.…”

Section: Related Workmentioning

confidence: 99%

“…The popular text to spectrogram models include Tacotron2 , Transformer-TTS (Li et al, 2019), FastSpeech2 (Ren et al, 2020), Fast-Pitch (Łańcucki, 2021), and Glow-TTS . In terms of voice quality the Tacotron2 model is still competitive with other models and less prone to over-fitting in low resource settings (Favaro et al, 2021;Abdelali et al, 2022;García et al, 2022;Finkelstein et al, 2022). There are multiple options for the vocoder as well like Clarinet (Ping et al, 2018), Waveglow (Prenger et al, 2019), MelGAN (Kumar et al, 2019), HiFiGAN , StyleMelGAN (Mustafa et al, 2021), and ParallelWaveGAN (Yamamoto et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Joshi,

Garera

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Text-to-speech (TTS) systems are being built using end-to-end deep learning approaches. However, these systems require huge amounts of training data. We present our approach to built production quality TTS and perform speaker adaptation in extremely low resource settings. We propose a transfer learning approach using high-resource language data and synthetically generated data. We transfer the learnings from the out-domain high-resource English language. Further, we make use of out-of-the-box single-speaker TTS in the target language to generate in-domain synthetic data. We employ a three-step approach to train a high-quality single-speaker TTS system in a low-resource Indian language Hindi. We use a Tacotron2 like setup with a spectrogram prediction network and a waveglow vocoder. The Tacotron2 acoustic model is trained on English data, followed by synthetic Hindi data from the existing TTS system. Finally, the decoder of this model is fine-tuned on only 3 hours of target Hindi speaker data to enable rapid speaker adaptation. We show the importance of this dual pre-training and decoder-only fine-tuning using subjective MOS evaluation. Using transfer learning from high-resource language and synthetic corpus we present a low-cost solution to train a custom TTS model.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Code-Mixed Text-to-Speech Synthesis Under Low-Resource Constraints

Joshi,

Garera

2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

show abstract

“…(1) Parallel corpus of different accents of the same speaker using source and target speech content and time alignment (Finkelstein et al, 2022;Liu et al, 2022;Hida et al, 2022;Toda et al, 2007;Oyamada et al, 2017). (2) Non-parallel corpus of * Corresponding author multiple speakers with multiple accents using inconsistent source and target speech content (Wang et al, 2021;Zhao et al, 2018Zhao et al, , 2019Kaneko et al, 2019Kaneko et al, , 2020aKaneko et al, , 2021Finkelstein et al, 2022) used a multi-stage trained tts model to achieve transfer of North American accents, Australian accents, and British accents, and used a CHiVE-BERT pre-training model to enhance the audio effect of accent generation. Liu et al (2022) added an accent variance adaptor to model the rhythmicity of accent variance, and also enhanced the accent generation audio by using a consistency constraint module.…”

Section: Introductionmentioning

confidence: 99%

Non-parallel Accent Transfer based on Fine-grained Controllable Accent Modelling

Wang,

Yu,

Yang

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

Existing accent transfer works rely on parallel data or speech recognition models. This paper focuses on the practical application of accent transfer and aims to implement accent transfer using non-parallel datasets. The study has encountered the challenge of speech representation disentanglement and modeling accents. In our accent modeling transfer framework, we manage to solve these problems by two proposed methods. First, we learn the suprasegmental information associated with tone to finely model the accents in terms of tone and rhythm. Second, we propose to use mutual information learning to disentangle the accent features and control the accent of the generated speech during the inference time. Experiments show that the proposed framework attains superior performance to the baseline models in terms of accentedness and audio quality.

show abstract