Chinese Spelling Check (CSC) is a challenging task due to the complex characteristics of Chinese characters. Statistics reveal that most Chinese spelling errors belong to phonological or visual errors. However, previous methods rarely utilize phonological and morphological knowledge of Chinese characters or heavily rely on external resources to model their similarities. To address the above issues, we propose a novel end-to-end trainable model called PHMOSpell, which promotes the performance of CSC with multi-modal information. Specifically, we derive pinyin and glyph representations for Chinese characters from audio and visual modalities respectively, which are integrated into a pre-trained language model by a well-designed adaptive gating mechanism. To verify its effectiveness, we conduct comprehensive experiments and ablation tests. Experimental results on three shared benchmarks demonstrate that our model consistently outperforms previous state-of-the-art models.1 pinyin is the official phonetic system of Mandarin Chinese, which usually consists of three parts: initials, finals and tones.2 radical is the basic building blocks of all Chinese charac-
Neural network-based model for text-to-speech (TTS) synthesis has made significant progress in recent years. In this paper, we present a cross-lingual, multi-speaker neural end-to-end TTS framework which can model speaker characteristics and synthesize speech in different languages. We implement the model by introducing a separately trained neural speaker embedding network, which can represent the latent structure of different speakers and language pronunciations. We train the speech synthesis network bilingually and prove the possibility of synthesizing Chinese speaker's English speech and vice versa. We explore different methods to fit a new speaker using only a few speech samples. The experimental results show that, with only several minutes of audio from a new speaker, the proposed model can synthesize speech bilingually and acquire decent naturalness and similarity for both languages.
Text Normalization (TN) is an essential part in conversational systems like text-to-speech synthesis (TTS) and automatic speech recognition (ASR). It is a process of transforming non-standard words (NSW) into a representation of how the words are to be spoken. Existing approaches to TN are mainly rule-based or hybrid systems, which require abundant handcrafted rules. In this paper, we treat TN as a neural machine translation problem and present a pure data-driven TN system using Transformer framework. Partial Parameter Generator (PPG) and Pointer-Generator Network (PGN) are combined in our model to improve accuracy of normalization and act as auxiliary modules to reduce the number of simple errors. The experiments demonstrate that our proposed model reaches remarkable performance on various semiotic classes.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.