ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053094
|View full text |Cite
|
Sign up to set email alerts
|

Code-Switched Speech Synthesis Using Bilingual Phonetic Posteriorgram with Only Monolingual Corpora

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
13
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
9

Relationship

1
8

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 16 publications
0
13
0
Order By: Relevance
“…Maiti et al [11] approximated the domain shift by computing the distance between speaker embeddings of source and target language of a bilingual speaker and use this distance to convert speaker embedding in source language space to target language space. Cao et al [14] used bilingual phonetic posteriorgrams as a language-independent feature to synthesize cross-lingual speech. It removed the language discrepancy in the feature level.…”
Section: Introductionmentioning
confidence: 99%
“…Maiti et al [11] approximated the domain shift by computing the distance between speaker embeddings of source and target language of a bilingual speaker and use this distance to convert speaker embedding in source language space to target language space. Cao et al [14] used bilingual phonetic posteriorgrams as a language-independent feature to synthesize cross-lingual speech. It removed the language discrepancy in the feature level.…”
Section: Introductionmentioning
confidence: 99%
“…Sun et al [8] proposed to use a speaker-independent speech recognition model to extract phonetic posteriorgrams (PPGs) from source speech (uttered by the source speaker) , and then use the target speaker data to map PPGs to the target speaker's speech. Cao et al [9] used PPGs in cross-language speech synthesis tasks, and the synthesized speech has been improved in terms of speech intelligibility and audio fidelity. In [10], the parameter space of the original speaker's voice was designed to be decomposed into two subspaces, one for modeling spoken content and the other one for the speaker 's voice.…”
Section: Introductionmentioning
confidence: 99%
“…[10] proposes a phonetic transformation network to learn target symbol distribution with the help of Automatic Speech Recognition (ASR) systems. In [11,12], language-independent Phonetic PosteriorGram (PPG) features of ASR models are used as input for cross-lingual TTS models. [ further proposes a mixed-lingual grapheme-to-phoneme (G2P) frontend to improve the pronunciation of mixed-lingual sentences in cross-lingual TTS systems.…”
Section: Introductionmentioning
confidence: 99%