Cross-lingual voice conversion (XVC) transforms the speaker identity of a source speaker to that of a target speaker who speaks a different language. Due to the intrinsic differences between languages, the converted speech may carry an unwanted foreign accent. In this paper, we first investigate the intelligibility of the converted speech and confirm the performance degradation caused by the accent/intelligibility issue. With the goal of generating native-sounding speech, this paper further proposes a novel training scheme with two additional linguistic losses for speech waveform generation: 1) a frame-wise phonetic content loss derived from bottleneck features, and 2) an automatic speech recognition loss on characters. Experiments were conducted between English and Mandarin Chinese conversions. The experimental results confirmed that the generated speech sounds more natural with the proposed linguistic losses and the proposed solution significantly improves speech intelligibility.
Accent Conversion (AC) seeks to change the accent of speech from one (source) to another (target) while preserving the speech content and speaker identity. However, many existing AC approaches rely on source-target parallel speech data during training or reference speech at run-time. We propose a novel accent conversion framework without the need for either parallel data or reference speech. Specifically, a text-to-speech (TTS) system is first pretrained with target-accented speech data. This TTS model and its hidden representations are expected to be associated only with the target accent. Then, a speech encoder is trained to convert the accent of the speech under the supervision of the pretrained TTS model. In doing so, the source-accented speech and its corresponding transcription are forwarded to the speech encoder and the pretrained TTS, respectively. The output of the speech encoder is optimized to be the same as the text embedding in the TTS system. At run-time, the speech encoder is combined with the pretrained speech decoder to convert the source-accented speech toward the target. In the experiments, we converted English with two source accents (Chinese/Indian) to the target accent (American/British/Canadian). Both objective metrics and subjective listening tests successfully validate that the proposed approach generates speech samples that are close to the target accent with high speech quality.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.