Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus

Kim, Minchan; Jeong, Myeong‐Hun; Choi, Byoung‐Joo; Ahn, S. H.; Lee, Joun Yeop; Kim, Nam Soo

doi:10.21437/interspeech.2022-225

Cited by 14 publications

(4 citation statements)

References 0 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The learned hidden state is then projected back to the output dimension of the original VITS text encoder to replace a part of the text encoder. Building upon this, we also observed the work of replacing the text encoder of the VITS model with a pseudo-phoneme [33] encoder. The specific process involves using wav2vec 2.0 to process the waveform, indexing, clustering, and merging the resulting hidden states to obtain representations of pseudo-phonemes.…”

Section: Methodsmentioning

confidence: 91%

See 1 more Smart Citation

BERTIVITS: The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections for End-to-End Speech Synthesis

Wang,

Song,

Zhou

2024

Applied Sciences

View full text Add to dashboard Cite

Enhancing the naturalness and rhythmicity of generated audio in end-to-end speech synthesis is crucial. The current state-of-the-art (SOTA) model, VITS, utilizes a conditional variational autoencoder architecture. However, it faces challenges, such as limited robustness, due to training solely on text and spectrum data from the training set. Particularly, the posterior encoder struggles with mid- and high-frequency feature extraction, impacting waveform reconstruction. Existing efforts mainly focus on prior encoder enhancements or alignment algorithms, neglecting improvements to spectrum feature extraction. In response, we propose BERTIVITS, a novel model integrating BERT into VITS. Our model features a redesigned posterior encoder with residual connections and utilizes pre-trained models to enhance spectrum feature extraction. Compared to VITS, BERTIVITS shows significant subjective MOS score improvements (0.16 in English, 0.36 in Chinese) and objective Mel-Cepstral coefficient reductions (0.52 in English, 0.49 in Chinese). BERTIVITS is tailored for single-speaker scenarios, improving speech synthesis technology for applications like post-class tutoring or telephone customer service.

show abstract

Section: Methodsmentioning

confidence: 91%

“…Kim et al simplified the TTS pipeline by dividing it into semantic and acoustic modeling stages, reducing training complexity [23]. Trini TTS [24] and NSV-TTS [25] focused on pitch-controllable models and self-supervised learning to extract unsupervised linguistic units, respectively.…”

Section: Advances In Model Architecturesmentioning

confidence: 99%

BERTIVITS: The Posterior Encoder Fusion of Pre-Trained Models and Residual Skip Connections for End-to-End Speech Synthesis

Wang,

Song,

Zhou

2024

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…In terms of generation quality, single-and multi-speaker TTS models can synthesize human-like voices with sufficient training data from the target speaker(s) [1][2][3][4][5]. Further, several fewor zero-shot multi-speaker TTS models have recently been developed to synthesize out-of-domain (OOD) speech with limited data from the target speaker [6][7][8][9][10][11]. These models are trained using a large multi-speaker dataset to learn a general TTS mapping relationship conditioned on speaker representations.…”

Section: Introductionmentioning

confidence: 99%

“…Especially, zero-shot multi-speaker TTS models [8][9][10][11] are widely being studied due to their unique advantage of not requiring any training data from the target speaker. A common approach of these models is to extract the speaker representations from reference speech using a reference encoder [7,12,13].…”

Section: Introductionmentioning

confidence: 99%

Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech

Yoon¹,

Kim²,

Song³

et al. 2023

Interspeech 2023

View full text Add to dashboard Cite

For personalized speech generation, a neural text-to-speech (TTS) model must be successfully implemented with limited data from a target speaker. To this end, the baseline TTS model needs to be amply generalized to out-of-domain data (i.e., target speaker's speech). However, approaches to address this outof-domain generalization problem in TTS have yet to be thoroughly studied. In this work, we propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. In particular, we prune off redundant connections from self-attention layers whose attention weights are below the threshold. To flexibly determine the pruning strength for searching optimal degree of generalization, we also propose a new differentiable pruning method that allows the model to automatically learn the thresholds. Evaluations on zero-shot multi-speaker TTS verify the effectiveness of our method in terms of voice quality and speaker similarity.

show abstract