2021
DOI: 10.48550/arxiv.2110.09698
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Neural Lexicon Reader: Reduce Pronunciation Errors in End-to-end TTS by Leveraging External Textual Knowledge

Abstract: End-to-end TTS suffers from high data requirements as it is difficult for both costly speech corpora to cover all necessary knowledge and neural models to learn the knowledge, hence additional knowledge needs to be injected manually. For example, to capture pronunciation knowledge on languages without regular orthography, a complicated grapheme-tophoneme pipeline needs to be built based on a structured, large pronunciation lexicon, leading to extra, sometimes high, costs to extend neural TTS to such languages.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…We also compare the pronunciation accuracy of our Dict-TTS with various types of systems, including: 1) a character-based system; 2) a BERT embedding based system [15], where the BERT derived embeddings are concatenated with the character embeddings; 3) NLR [16], a TTS system that directly injects BERT derived knowledge into the linguistic encoder without phoneme labels; 4) Phoneme (G2PM [34]), PortaSpeech with phoneme labels derived from G2PM (a powerful neural G2P system); 5) Phoneme (pypinyin), PortaSpeech with phoneme labels derived from pypinyin (one of the most popular Chinese G2P system). As shown in Table 2, Dict-TTS greatly surpasses the systems that implicitly model the semantic representations for character-to-pronunciation mapping like NLR [16] and shows comparable performance with phoneme-based systems. Since our Dict-TTS does not require any phoneme labels for training, we can pre-train Dict-TTS on a large-scaled ASR dataset [57] with a small amount of effort to improve its generalization capacity.…”
Section: Results Of Pronunciation Accuracymentioning
confidence: 99%
See 2 more Smart Citations
“…We also compare the pronunciation accuracy of our Dict-TTS with various types of systems, including: 1) a character-based system; 2) a BERT embedding based system [15], where the BERT derived embeddings are concatenated with the character embeddings; 3) NLR [16], a TTS system that directly injects BERT derived knowledge into the linguistic encoder without phoneme labels; 4) Phoneme (G2PM [34]), PortaSpeech with phoneme labels derived from G2PM (a powerful neural G2P system); 5) Phoneme (pypinyin), PortaSpeech with phoneme labels derived from pypinyin (one of the most popular Chinese G2P system). As shown in Table 2, Dict-TTS greatly surpasses the systems that implicitly model the semantic representations for character-to-pronunciation mapping like NLR [16] and shows comparable performance with phoneme-based systems. Since our Dict-TTS does not require any phoneme labels for training, we can pre-train Dict-TTS on a large-scaled ASR dataset [57] with a small amount of effort to improve its generalization capacity.…”
Section: Results Of Pronunciation Accuracymentioning
confidence: 99%
“…However, these methods still require annotated data to train and can not be incorporated into the TTS training in an end-to-end manner. Although NLR [16] directly injects BERT-derived knowledge into the TTS systems without phoneme labels and successfully reduces pronunciation errors, their method confounds the acoustic and semantic space, which significantly affects the pronunciation accuracy.…”
Section: Grapheme-to-phonemementioning
confidence: 99%
See 1 more Smart Citation