10th ISCA Workshop on Speech Synthesis (SSW 10) 2019
DOI: 10.21437/ssw.2019-40
|View full text |Cite
|
Sign up to set email alerts
|

A Comparison of Letters and Phones as Input to Sequence-to-Sequence Models for Speech Synthesis

Abstract: Neural sequence-to-sequence (S2S) models for text-tospeech synthesis (TTS) may take letter or phone input sequences. Since for many languages phones have a more direct relationship to the acoustic signal, they lead to improved quality. But generating phone transcriptions from text requires an expensive dictionary and an error-prone grapheme-to-phoneme (G2P) model, and the relative improvement over using letters has yet to be quantified. In approaching this question, we presume that letter-input S2S models must… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
12
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2
1

Relationship

3
5

Authors

Journals

citations
Cited by 16 publications
(13 citation statements)
references
References 14 publications
1
12
0
Order By: Relevance
“…System P system outperformed system G, repeating our previous findings from [10] with a different Tacotron implementation and vocoder. This is because phones map pronunciations directly unlike letters, leading to fewer ambiguities at training and test time.…”
Section: Mushra Results For Tts Comparisonsupporting
confidence: 84%
See 2 more Smart Citations
“…System P system outperformed system G, repeating our previous findings from [10] with a different Tacotron implementation and vocoder. This is because phones map pronunciations directly unlike letters, leading to fewer ambiguities at training and test time.…”
Section: Mushra Results For Tts Comparisonsupporting
confidence: 84%
“…In previous work [10], we found phone input significantly improved the quality of Ophelia 3 a different S2S architecture, based on DCTTS [26]. Here, we also measure the effect of morphemes when using phone-based input.…”
Section: Mushra Designmentioning
confidence: 80%
See 1 more Smart Citation
“…Phonemes are often used as atomic input symbols to text-tospeech (TTS) systems as an explicit representation of the pronunciation of input text [1]. This is useful even for large neural sequence-to-sequence models which have the capacity to learn implicit pronunciation models directly from text inputs but which may make mistakes compared to grapheme-to-phoneme (g2p) conversion models trained on high-quality lexicons [2,3]. Such large TTS models are typically trained using tens of hours of audio data with associated text transcriptions, which alongside the specialist linguistic knowledge required to convert raw text into phoneme strings are expensive resources to attain and limit the application of these models to a small proportion of the world's 7,000 languages.…”
Section: Introductionmentioning
confidence: 99%
“…Recently proposed neural architectures [1,2] have shown that an efficient end-to-end acoustic model is possible by directly consuming text characters. The inputs to state-of-the-art TTS systems consist of either text characters (graphemes) or phonemes, with the superiority of phoneme-based systems recently quantified [3]. In multilingual TTS, these inputs may originate from various speakers and languages introducing variable factors in the model's logic.…”
Section: Introductionmentioning
confidence: 99%