End-to-end Code-switched TTS with Mix of Monolingual Recordings

Cao, Yuewen; Wu, Xixin; Liu, Songxiang; Yu, Jianwei; Liu, Xü; Wu, Zhiyong; Liu, Xunying; Meng, Helen

doi:10.1109/icassp.2019.8682927

Cited by 30 publications

(16 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Different from Tacotron, we use the duration from the ASR model for frame expansion instead of attentionbased soft alignment. To represent and control the speaker identity and accent, we separately add accent embedding into the encoder and speaker embedding into the decoder [26]. To wipe out speaker-related information in the encoder output, we add an auxiliary speaker classifier after the encoder and adversarial training strategy is adopted, which will be introduced in detail in Section 3.3.…”

Section: Overview and Model Architecturementioning

confidence: 99%

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

This paper proposes an interesting voice and accent joint conversion approach, which can convert an arbitrary source speaker's voice to a target speaker with non-native accent. This problem is challenging as each target speaker only has training data in native accent and we need to disentangle accent and speaker information in the conversion model training and re-combine them in the conversion stage. In our recognitionsynthesis conversion framework, we manage to solve this problem by two proposed tricks. First, we use accent-dependent speech recognizers to obtain bottleneck features for different accented speakers. This aims to wipe out other factors beyond the linguistic information in the BN features for conversion model training. Second, we propose to use adversarial training to better disentangle the speaker and accent information in our encoder-decoder based conversion model. Specifically, we plug an auxiliary speaker classifier to the encoder, trained with an adversarial loss to wipe out speaker information from the encoder output. Experiments show that our approach is superior to the baseline. The proposed tricks are quite effective in improving accentedness and audio quality and speaker similarity are well maintained.

show abstract

Section: Overview and Model Architecturementioning

confidence: 99%

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

Wang

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Therefore, our approach can be seen as analogous to these works. Our work is closest to [20,21] in that we use monolingual recordings. However, we explicitly work in the latent prior space while [20] operate at the level of encoding individual languages and [21] begin with an average voice and refine it using phoneme informed attention.…”

Section: Synthesis Of Code Mixed Textmentioning

confidence: 99%

Variational Attention Using Articulatory Priors for Generating Code Mixed Speech Using Monolingual Corpora

Rallabandi

Black

2019

Interspeech 2019

View full text Add to dashboard Cite

Code Mixing-phenomenon where lexical items from one language are embedded in the utterance of another-is relatively frequent in multilingual communities and therefore speech systems should be able to process such content. However, building a voice capable of synthesizing such content typically requires bilingual recordings from the speaker which might not always be easy to obtain. In this work, we present an approach for building mixed lingual systems using only monolingual corpora. Specifically we present a way to train multi speaker text to speech system by incorporating stochastic latent variables into the attention mechanism with the objective of synthesizing code mixed content. We subject the prior distribution for such latent variables to match articulatory constraints. Subjective evaluation shows that our systems are capable of generating high quality synthesis in code mixed scenarios.

show abstract

“…In parallel with the ZS-TTS, multilingual TTS has also evolved aiming at learning models for multiple languages at the same time [14,15,16,17]. Some of these models are particularly interesting as they allow for code-switching, i.e.…”

Section: Introductionmentioning

confidence: 99%

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Casanova¹,

Weber²,

Shulby³

et al. 2021

Preprint

View full text Add to dashboard Cite

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zeroshot multi-speaker and multilingual training. We achieved stateof-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zeroshot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

show abstract

End-to-end Code-switched TTS with Mix of Monolingual Recordings

Cited by 30 publications

References 11 publications

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

Accent and Speaker Disentanglement in Many-to-many Voice Conversion

Variational Attention Using Articulatory Priors for Generating Code Mixed Speech Using Monolingual Corpora

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

Contact Info

Product

Resources

About