Inequality Maximum Entropy Classifier with Character Features for Polyphone Disambiguation in Mandarin TTS Systems

Mao, Xinnian; Yuan, Dong; Han, Junyu; Huang, Dezhi; Wang, Haila

doi:10.1109/icassp.2007.367010

Cited by 17 publications

(10 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, this requires a substantial amount of linguistic knowledge. The data driven approach, by contrast, adopts statistical methods such as Decision Tree [3] or Maximum Entropy Model [2,10]. Recently [1,4] use bidirectional Long Short-Term Memory (LSTM) [11] to extract diverse features on the character, word, and sentence level.…”

Section: Related Workmentioning

confidence: 99%

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

Park¹,

Lee

2020

Interspeech 2020

View full text Add to dashboard Cite

Conversion of Chinese graphemes to phonemes (G2P) is an essential component in Mandarin Chinese Text-To-Speech (TTS) systems. One of the biggest challenges in Chinese G2P conversion is how to disambiguate the pronunciation of polyphones-characters having multiple pronunciations. Although many academic efforts have been made to address it, there has been no open dataset that can serve as a standard benchmark for fair comparison to date. In addition, most of the reported systems are hard to employ for researchers or practitioners who want to convert Chinese text into pinyin at their convenience. Motivated by these, in this work, we introduce a new benchmark dataset that consists of 99,000+ sentences for Chinese polyphone disambiguation. We train a simple neural network model on it, and find that it outperforms other preexisting G2P systems. Finally, we package our project and share it on PyPi.

show abstract

Section: Related Workmentioning

confidence: 99%

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

Park¹,

Lee

2020

Interspeech 2020

View full text Add to dashboard Cite

show abstract

“…Additionally, tone sandhi and Erhua are also the key issues in the intelligibility of Mandarin TTS. The most popular method for PD is to apply the ME model for each polyphone [10]. Using a unified model for all polyphones were also investigated [11], recently.…”

Section: Grapheneme-to-phonemementioning

confidence: 99%

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis

Pan

Yin

Zhang

et al. 2020

ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

View full text Add to dashboard Cite

In Mandarin text-to-speech (TTS) system, the front-end text processing module significantly influences the intelligibility and naturalness of synthesized speech. Building a typical pipeline-based front-end which consists of multiple individual components requires extensive efforts. In this paper, we proposed a unified sequence-to-sequence front-end model for Mandarin TTS that converts raw texts to linguistic features directly. Compared to the pipeline-based front-end, our unified front-end can achieve comparable performance in polyphone disambiguation and prosody word prediction, and improve intonation phrase prediction by 0.0738 in F1 score. We also implemented the unified front-end with Tacotron and WaveRNN to build a Mandarin TTS system. The synthesized speech by that got a comparable MOS (4.38) with the pipeline-based front-end (4.37) and close to human recordings (4.49).

show abstract

“…Therefore a classifier is need to be learned for each character to predict its correct pronunciation in given context. For polyphone disambiguation, machine learning methods like ME [31], CART [29] or MLP are also common used. and the traditional linguistic features (like character, POS, word-terminal syllables etc.)…”

Section: Polyphone Disambiguationmentioning

confidence: 99%

Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis

Yang¹,

Zhong²,

Liu

2019

Interspeech 2019

View full text Add to dashboard Cite

In this paper, we propose a novel method to improve the performance and robustness of the front-end text processing modules of Mandarin text-to-speech (TTS) synthesis. We use pretrained text encoding models, such as the encoder of a transformer based NMT model and BERT, to extract the latent semantic representations of words or characters and use them as input features for tasks in the front-end of TTS systems. Our experiments on the tasks of Mandarin polyphone disambiguation and prosodic structure prediction show that the proposed method can significantly improve the performances. Specifically, we get an absolute improvement of 0.013 and 0.027 in F1 score for prosodic word prediction and prosodic phrase prediction respectively, and an absolute improvement of 2.44% in polyphone disambiguation compared to previous methods.

show abstract

Inequality Maximum Entropy Classifier with Character Features for Polyphone Disambiguation in Mandarin TTS Systems

Cited by 17 publications

References 5 publications

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

g2pM: A Neural Grapheme-to-Phoneme Conversion Package for Mandarin Chinese Based on a New Open Benchmark Dataset

A Unified Sequence-to-Sequence Front-End Model for Mandarin Text-to-Speech Synthesis

Pre-Trained Text Representations for Improving Front-End Text Processing in Mandarin Text-to-Speech Synthesis

Contact Info

Product

Resources

About