2022
DOI: 10.48550/arxiv.2201.09427
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Polyphone disambiguation and accent prediction using pre-trained language models in Japanese TTS front-end

Abstract: Although end-to-end text-to-speech (TTS) models can generate natural speech, challenges still remain when it comes to estimating sentence-level phonetic and prosodic information from raw text in Japanese TTS systems. In this paper, we propose a method for polyphone disambiguation (PD) and accent prediction (AP). The proposed method incorporates explicit features extracted from morphological analysis and implicit features extracted from pre-trained language models (PLMs). We use BERT and Flair embeddings as imp… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 13 publications
0
2
0
Order By: Relevance
“…The pronunciation of a polyphone is defined by the semantic context information of neighbouring characters [46]. In order to comprehend the semantic meaning in the given sentence for polyphone disambiguation, previous methods [11,53,45,18,8] have adopted the pre-trained language model [12] to extract semantic features from raw character sequences and predict the pronunciation of polyphones with neural classifiers according to the semantic features. Among them, PnG BERT and Mixed-Phoneme BERT [22,58] take both phoneme and grapheme as input to train an augmented BERT and use the pre-trained augmented BERT as the TTS encoder.…”
Section: Grapheme-to-phonemementioning
confidence: 99%
See 1 more Smart Citation
“…The pronunciation of a polyphone is defined by the semantic context information of neighbouring characters [46]. In order to comprehend the semantic meaning in the given sentence for polyphone disambiguation, previous methods [11,53,45,18,8] have adopted the pre-trained language model [12] to extract semantic features from raw character sequences and predict the pronunciation of polyphones with neural classifiers according to the semantic features. Among them, PnG BERT and Mixed-Phoneme BERT [22,58] take both phoneme and grapheme as input to train an augmented BERT and use the pre-trained augmented BERT as the TTS encoder.…”
Section: Grapheme-to-phonemementioning
confidence: 99%
“…Capturing the pronunciations from raw texts is challenging for end-to-end text-to-speech (TTS) systems [2,28,38,41,51,24,14,37], since there are full of words that are not covered by general pronunciation rules [4,21,46]. Therefore, polyphone 3 disambiguation (one of the biggest challenges in converting texts into phonemes [33,56,42]) plays an important role in the construction of highquality neural TTS systems [18,34]. However, since the exact pronunciation of a polyphone must be The illustration of the dictionary entry that contains information on character's or word's definitions, usages, and pronunciations.…”
Section: Introductionmentioning
confidence: 99%