2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9004008
|View full text |Cite
|
Sign up to set email alerts
|

Bootstrapping Non-Parallel Voice Conversion from Speaker-Adaptive Text-to-Speech

Abstract: Voice conversion (VC) and text-to-speech (TTS) are two tasks that share a similar objective, generating speech with a target voice. However, they are usually developed independently under vastly different frameworks. In this paper, we propose a methodology to bootstrap a VC system from a pretrained speaker-adaptive TTS model and unify the techniques as well as the interpretations of these two tasks. Moreover by offloading the heavy data demand to the training stage of the TTS model, our VC system can be built … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
22
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(22 citation statements)
references
References 52 publications
0
22
0
Order By: Relevance
“…Another line of work show that training an integrated system capable of performing either TTS or VC can boost individual performances [29], [54]. Bootstrapping from a speaker-adaptive or multispeaker TTS model is another active direction [55], [56]. However, few of the abovementioned methods were designed to tackle the data deficiency problem for seq2seq based parallel, one-to-one VC, which is the main scope of this paper.…”
Section: Transfer Learning From Asr and Tts For Vcmentioning
confidence: 99%
“…Another line of work show that training an integrated system capable of performing either TTS or VC can boost individual performances [29], [54]. Bootstrapping from a speaker-adaptive or multispeaker TTS model is another active direction [55], [56]. However, few of the abovementioned methods were designed to tackle the data deficiency problem for seq2seq based parallel, one-to-one VC, which is the main scope of this paper.…”
Section: Transfer Learning From Asr and Tts For Vcmentioning
confidence: 99%
“…The text transcripts are only required during training. Zhang et al [46], and Luong et al [45] proposed joint training of TTS and VC by sharing a common decoder. Park et al [44] proposed to use the context vectors in Tacotron system as speaker-independent linguistic representation to guide the voice conversion.…”
Section: Leveraging Knowledge From Speech Synthesismentioning
confidence: 99%
“…The VC encoder seeks to generate speakerindependent linguistic features from input spectral features, while the VC decoder reconstructs the mel-spectrum features from the linguistic features, conditioning on a speaker code. Studies show that voice conversion benefits from explicit phonetic modeling that ensure adherence to linguistic content during conversion [44], [45].…”
Section: B Transfer Learning From Tts To Vcmentioning
confidence: 99%
See 2 more Smart Citations