Interspeech 2022 2022
DOI: 10.21437/interspeech.2022-225
|View full text |Cite
|
Sign up to set email alerts
|

Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 14 publications
(4 citation statements)
references
References 0 publications
0
4
0
Order By: Relevance
“…The learned hidden state is then projected back to the output dimension of the original VITS text encoder to replace a part of the text encoder. Building upon this, we also observed the work of replacing the text encoder of the VITS model with a pseudo-phoneme [33] encoder. The specific process involves using wav2vec 2.0 to process the waveform, indexing, clustering, and merging the resulting hidden states to obtain representations of pseudo-phonemes.…”
Section: Methodsmentioning
confidence: 91%
See 1 more Smart Citation
“…The learned hidden state is then projected back to the output dimension of the original VITS text encoder to replace a part of the text encoder. Building upon this, we also observed the work of replacing the text encoder of the VITS model with a pseudo-phoneme [33] encoder. The specific process involves using wav2vec 2.0 to process the waveform, indexing, clustering, and merging the resulting hidden states to obtain representations of pseudo-phonemes.…”
Section: Methodsmentioning
confidence: 91%
“…Kim et al simplified the TTS pipeline by dividing it into semantic and acoustic modeling stages, reducing training complexity [23]. Trini TTS [24] and NSV-TTS [25] focused on pitch-controllable models and self-supervised learning to extract unsupervised linguistic units, respectively.…”
Section: Advances In Model Architecturesmentioning
confidence: 99%
“…In terms of generation quality, single-and multi-speaker TTS models can synthesize human-like voices with sufficient training data from the target speaker(s) [1][2][3][4][5]. Further, several fewor zero-shot multi-speaker TTS models have recently been developed to synthesize out-of-domain (OOD) speech with limited data from the target speaker [6][7][8][9][10][11]. These models are trained using a large multi-speaker dataset to learn a general TTS mapping relationship conditioned on speaker representations.…”
Section: Introductionmentioning
confidence: 99%
“…Especially, zero-shot multi-speaker TTS models [8][9][10][11] are widely being studied due to their unique advantage of not requiring any training data from the target speaker. A common approach of these models is to extract the speaker representations from reference speech using a reference encoder [7,12,13].…”
Section: Introductionmentioning
confidence: 99%