Interspeech 2020 2020
DOI: 10.21437/interspeech.2020-1821
|View full text |Cite
|
Sign up to set email alerts
|

Phonological Features for 0-Shot Multilingual Speech Synthesis

Abstract: Code-switching-the intra-utterance use of multiple languages-is prevalent across the world. Within text-tospeech (TTS), multilingual models have been found to enable code-switching [1][2][3]. By modifying the linguistic input to sequence-to-sequence TTS, we show that code-switching is possible for languages unseen during training, even within monolingual models. We use a small set of phonological features derived from the International Phonetic Alphabet (IPA), such as vowel height and frontness, consonant plac… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
16
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 15 publications
(17 citation statements)
references
References 19 publications
1
16
0
Order By: Relevance
“…As such, at the beginning of our fine-tuning regime the encoder of our German model is initialised with a representation of /x/ which already contains much information learned from the English /k/, supplemented by [+continuant] English phonemes such as /s/. Although we do not test it formally here, we find these initial representations to produce somewhat intelligible German speech even before any target-language data has been seen by the model, as in [10], albeit retaining our English source speaker's vocal quality and accent.…”
Section: Phonological Featuresmentioning
confidence: 93%
See 4 more Smart Citations
“…As such, at the beginning of our fine-tuning regime the encoder of our German model is initialised with a representation of /x/ which already contains much information learned from the English /k/, supplemented by [+continuant] English phonemes such as /s/. Although we do not test it formally here, we find these initial representations to produce somewhat intelligible German speech even before any target-language data has been seen by the model, as in [10], albeit retaining our English source speaker's vocal quality and accent.…”
Section: Phonological Featuresmentioning
confidence: 93%
“…Our binary feature representation largely overlaps with that used in PanPhon [12], and differs from the multi-valued features used in [10], which map more directly to IPA categories such as vowel frontness or consonant place. While our feature set gives a more compact representation, with 24 features vs. 60 in [10] (after conversion to binary vectors), it is perhaps less interpretable in familiar linguistic terms, for example with the palatal place of articulation feature in a multi-valued representation instead being composed from [+high, −low, −back] feature specifications in our system.…”
Section: Phonological Featuresmentioning
confidence: 99%
See 3 more Smart Citations