Speech Enhancement-assisted Voice Conversion in Noisy Environments

Chan, Yun-Ju; Peng, Chiang-Jen; Wang, Syu-Siang; Wang, Hsin‐Min; Tsao, Yu; Chi, Tai-Shih

doi:10.48550/arxiv.2110.09923

Cited by 1 publication

(1 citation statement)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus a voice conversion approach desires to be noise-robust. Existing noise-robust conversion approaches mostly aim at addressing the background noise existing in the source speech, with the premise to convert the noisy source speech to the target speakers with clean speech for system building [14,15,16,17]. In other words, the target speaker's speech samples are assumed to be recorded Major work performed while Liumeng Xue interning at Tencent AI Lab.…”

Section: Introductionmentioning

confidence: 99%

Learning Noise-independent Speech Representation for High-quality Voice Conversion for Noisy Target Speakers

Xue¹,

Yang²,

Hu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Building a voice conversion system for noisy target speakers, such as users providing noisy samples or Internet found data, is a challenging task since the use of contaminated speech in model training will apparently degrade the conversion performance. In this paper, we leverage the advances of our recently proposed Glow-WaveGAN [1] and propose a noise-independent speech representation learning approach for high-quality voice conversion for noisy target speakers. Specifically, we learn a latent feature space where we ensure that the target distribution modeled by the conversion model is exactly from the modeled distribution of the waveform generator. With this premise, we further manage to make the latent feature to be noise-invariant. Specifically, we introduce a noise-controllable WaveGAN, which directly learns the noise-independent acoustic representation from waveform by the encoder and conducts noise control in the hidden space through a FiLM [2] module in the decoder. As for the conversion model, importantly, we use a flow-based model to learn the distribution of noiseindependent but speaker-related latent features from phoneme posteriorgrams. Experimental results demonstrate that the proposed model achieves high speech quality and speaker similarity in the voice conversion for noisy target speakers.

show abstract

Section: Introductionmentioning

confidence: 99%