Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-10115
|View full text |Cite
|
Sign up to set email alerts
|

Training Text-To-Speech Systems From Synthetic Data: A Practical Approach For Accent Transfer Tasks

Abstract: Transfer tasks in text-to-speech (TTS) synthesis -where one or more aspects of the speech of one set of speakers is transferred to another set of speakers that do not feature these aspects originally -remains a challenging task. One of the challenges is that models that have high-quality transfer capabilities can have issues in stability, making them impractical for user-facing critical tasks. This paper demonstrates that transfer can be obtained by training a robust TTS system on data generated by a less robu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
2
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(3 citation statements)
references
References 18 publications
0
2
0
Order By: Relevance
“…This synthetic corpus along with real corpus is then used to train a non-auto-regressive TTS system. A similar approach utilizing synthetic corpus from existing TTS is explored in (Finkelstein et al, 2022;Song et al, 2022). Our work is similar to these approaches where the common aspect is to generate synthetic audio from another TTS system.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…This synthetic corpus along with real corpus is then used to train a non-auto-regressive TTS system. A similar approach utilizing synthetic corpus from existing TTS is explored in (Finkelstein et al, 2022;Song et al, 2022). Our work is similar to these approaches where the common aspect is to generate synthetic audio from another TTS system.…”
Section: Related Workmentioning
confidence: 99%
“…The popular text to spectrogram models include Tacotron2 , Transformer-TTS (Li et al, 2019), FastSpeech2 (Ren et al, 2020), Fast-Pitch (Łańcucki, 2021), and Glow-TTS . In terms of voice quality the Tacotron2 model is still competitive with other models and less prone to over-fitting in low resource settings (Favaro et al, 2021;Abdelali et al, 2022;García et al, 2022;Finkelstein et al, 2022). There are multiple options for the vocoder as well like Clarinet (Ping et al, 2018), Waveglow (Prenger et al, 2019), MelGAN (Kumar et al, 2019), HiFiGAN , StyleMelGAN (Mustafa et al, 2021), and ParallelWaveGAN (Yamamoto et al, 2020).…”
Section: Introductionmentioning
confidence: 99%
“…(1) Parallel corpus of different accents of the same speaker using source and target speech content and time alignment (Finkelstein et al, 2022;Liu et al, 2022;Hida et al, 2022;Toda et al, 2007;Oyamada et al, 2017). (2) Non-parallel corpus of * Corresponding author multiple speakers with multiple accents using inconsistent source and target speech content (Wang et al, 2021;Zhao et al, 2018Zhao et al, , 2019Kaneko et al, 2019Kaneko et al, , 2020aKaneko et al, , 2021Finkelstein et al, 2022) used a multi-stage trained tts model to achieve transfer of North American accents, Australian accents, and British accents, and used a CHiVE-BERT pre-training model to enhance the audio effect of accent generation. Liu et al (2022) added an accent variance adaptor to model the rhythmicity of accent variance, and also enhanced the accent generation audio by using a consistency constraint module.…”
Section: Introductionmentioning
confidence: 99%