2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) 2019
DOI: 10.1109/asru46091.2019.9003798
|View full text |Cite
|
Sign up to set email alerts
|

A Modularized Neural Network with Language-Specific Output Layers for Cross-Lingual Voice Conversion

Abstract: This paper presents a cross-lingual voice conversion framework that adopts a modularized neural network. The modularized neural network has a common input structure that is shared for both languages, and two separate output modules, one for each language. The idea is motivated by the fact that phonetic systems of languages are similar because humans share a common vocal production system, but acoustic renderings, such as prosody and phonotactic, vary a lot from language to language. The modularized neural netw… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 12 publications
(8 citation statements)
references
References 52 publications
0
8
0
Order By: Relevance
“…It provides linguistic representation to characterize two phonetic language systems in XVC tasks [37]. Moreover, the model training with language-specific layers is another of our previous studies on XVC, where the conversion model has a hidden section and a output section [46], [48]. The hidden sections is shared between two languages, while the output section has two language-specific heads, each for one language [48].…”
Section: A Bnf-to-waveform Xvc Frameworkmentioning
confidence: 99%
See 1 more Smart Citation
“…It provides linguistic representation to characterize two phonetic language systems in XVC tasks [37]. Moreover, the model training with language-specific layers is another of our previous studies on XVC, where the conversion model has a hidden section and a output section [46], [48]. The hidden sections is shared between two languages, while the output section has two language-specific heads, each for one language [48].…”
Section: A Bnf-to-waveform Xvc Frameworkmentioning
confidence: 99%
“…Moreover, the model training with language-specific layers is another of our previous studies on XVC, where the conversion model has a hidden section and a output section [46], [48]. The hidden sections is shared between two languages, while the output section has two language-specific heads, each for one language [48]. The shared layers serve as the bridge between the two languages, while the language-specific layers take care of the acoustic renderings individually.…”
Section: A Bnf-to-waveform Xvc Frameworkmentioning
confidence: 99%
“…Our work is mostly inspired by the cross-lingual conversion based on modularized neural network [21], accent conversion using accented ASR to learn accent-agnostic linguistic representations [20] and the above adversarial learning approaches [19,24,25]. In this work, we particularly focus on the voice and accent joint many-to-many conversion task, where the source voice from an arbitrary speaker can be converted to a target speaker (from a target set) with specified accent and more challengingly, the target speaker does not have the training data for this specified accent.…”
Section: Related Workmentioning
confidence: 99%
“…Specifically, we adopt a well-trained speech recognizer to first transform the source speaker's voice to bottleneck features. Inspired by [20,21], we use accent-dependent speech recognizers to obtain BN features for different accented speakers. This aims to further disentangle the linguistic information from other factors including accents in the BN features for conversion model training.…”
Section: Introductionmentioning
confidence: 99%
“…Specifically, Phonetic PosteriorGram (PPG) based models are often used for the cross-lingual scenario [9]. Even though a monolingual PPG trained on the target language is good enough for cross-lingual VC, it was reported that a bilingual PPG [20,19] or mixed-lingual PPG [21] can significantly improve the performance. Non-parallel VC systems based on a Variational Autoencoder (VAE) [22] or Generative Adversarial Network (GAN) [23,24] are also applicable to the cross-lingual scenario.…”
Section: Introductionmentioning
confidence: 99%