ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413959
|View full text |Cite
|
Sign up to set email alerts
|

Crank: An Open-Source Software for Nonparallel Voice Conversion Based on Vector-Quantized Variational Autoencoder

Abstract: In this paper, we present an open-source software for developing a nonparallel voice conversion (VC) system named crank. Although we have released an open-source VC software based on the Gaussian mixture model named sprocket in the last VC Challenge, it is not straightforward to apply any speech corpus because it is necessary to prepare parallel utterances of source and target speakers to model a statistical conversion function. To address this issue, in this study, we developed a new open-source VC software t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
7
1
1

Relationship

2
7

Authors

Journals

citations
Cited by 14 publications
(9 citation statements)
references
References 25 publications
0
9
0
Order By: Relevance
“…PPGs acts as speaker-invariant representations that can be easily utilized to achieve conversion. The other main approach is with auto-encoding style training approaches, utilizing both variational auto-encoder (VAE) [3,4,5,6,7,8] and vector-quantized VAE (VQ-VAE) [9,10,11,12] approaches. The auto-encoding style training approaches try to disentangle speaker speaker information from the content of source speech, using various methods from bottlenecking to adversarial training to prevent speaker leakage in the content or speech encoder.…”
Section: Related Workmentioning
confidence: 99%
“…PPGs acts as speaker-invariant representations that can be easily utilized to achieve conversion. The other main approach is with auto-encoding style training approaches, utilizing both variational auto-encoder (VAE) [3,4,5,6,7,8] and vector-quantized VAE (VQ-VAE) [9,10,11,12] approaches. The auto-encoding style training approaches try to disentangle speaker speaker information from the content of source speech, using various methods from bottlenecking to adversarial training to prevent speaker leakage in the content or speech encoder.…”
Section: Related Workmentioning
confidence: 99%
“…For the VAE model, we used crank [15], an open-source VC software that combines recent advances in autoencoder-based VC methods, including the use of hierarchical architectures, cyclic loss and adversarial training. To take full advantage of unsupervised learning, we trained the network using not only the data of the patient and the reference speakers but also a multi-speaker TTS dataset.…”
Section: Nonparallel Frame-wise Modelmentioning
confidence: 99%
“…The ability of seq2seq VC models to convert suprasegmental information and the parallel training strategy can greatly improve the naturalness and intelligibility, though the speaker identity is changed into that of the reference speaker. Next, a frame-wise, nonparallel VC model realized by a variational autoencoder (VAE) [13,14,15] takes the converted speech with the identity of the reference speaker as input and restores the identity of the patient. An important assumption we make here is that due to the frame-wise constraint, the VAE model changes only time-invariant characteristics such as the speaker identity, while preserving time-variant characteristics, such as pronunciation.…”
Section: Introductionmentioning
confidence: 99%
“…Previous researches on multi-speaker TTS remain requires hundreds or thousands of high-quality training speeches per person [15]- [17]. Several studies utilize voice conversion [18]- [20] to augment both speaker and speech databases to address the extensive training data required. But the recent advance of voice conversion is hard to synthesize noisefree speeches.…”
Section: Introductionmentioning
confidence: 99%