2021
DOI: 10.48550/arxiv.2106.08873
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments

Abstract: Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noi… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 24 publications
0
1
0
Order By: Relevance
“…Besides, an intuitive way is to leverage a speech enhancement module to remove noise before training [19], but it will inevitably affect the quality of generated speech because the extra speech distortion after speech enhancement will propagate to the acoustic model as well as the vocoder [20]. Motivated by denoising auto-encoder [21], denoising training strategy is also applied in several robust VC systems [22,19]. It has been reported that the denoising method leads to worse naturalness than the speech enhancement method, while the speech enhancement-based method has lower speaker similarity scores than the denoising approach [19].…”
Section: Introductionmentioning
confidence: 99%
“…Besides, an intuitive way is to leverage a speech enhancement module to remove noise before training [19], but it will inevitably affect the quality of generated speech because the extra speech distortion after speech enhancement will propagate to the acoustic model as well as the vocoder [20]. Motivated by denoising auto-encoder [21], denoising training strategy is also applied in several robust VC systems [22,19]. It has been reported that the denoising method leads to worse naturalness than the speech enhancement method, while the speech enhancement-based method has lower speaker similarity scores than the denoising approach [19].…”
Section: Introductionmentioning
confidence: 99%