It is very import for Chinese language processing with the aid of an efficient input method engine (IME), of which pinyinto-Chinese (PTC) conversion is the core part. Meanwhile, though typos are inevitable during user pinyin inputting, existing IMEs paid little attention to such big inconvenience. In this paper, motivated by a key equivalence of two decoding algorithms, we propose a joint graph model to globally optimize PTC and typo correction for IME. The evaluation results show that the proposed method outperforms both existing academic and commercial IMEs.
Spelling check for Chinese has more challenging difficulties than that for other languages. A hybrid model for Chinese spelling check is presented in this article. The hybrid model consists of three components: one graph-based model for generic errors and two independently trained models for specific errors. In the graph model, a directed acyclic graph is generated for each sentence, and the single-source shortest-path algorithm is performed on the graph to detect and correct general spelling errors at the same time. Prior to that, two types of errors over functional words (characters) are first solved by conditional random fields: the confusion of “在” ( at ) (pinyin is zai in Chinese), “再” ( again , more , then ) (pinyin: zai ) and “的” ( of ) (pinyin: de ), “地” (- ly , adverb-forming particle) (pinyin: de ), and “得” ( so that , have to ) (pinyin: de ). Finally, a rule-based model is exploited to distinguish pronoun usage confusion: “她” ( she ) (pinyin: ta ), “他” ( he ) (pinyin: ta ), and some other common collocation errors. The proposed model is evaluated on the standard datasets released by the SIGHAN Bake-off shared tasks, giving state-of-the-art results.
In this paper, we propose an improved graph model for Chinese spell checking. The model is based on a graph model for generic errors and two independentlytrained models for specific errors. First, a graph model represents a Chinese sentence and a modified single source shortest path algorithm is performed on the graph to detect and correct generic spelling errors.Then, we utilize conditional random fields to solve two specific kinds of common errors: the confusion of "在" (at) (pinyin is 'zai' in Chinese), "再" (again, more, then) (pinyin: zai) and "的" (of) (pinyin: de), "地" (-ly, adverb-forming particle) (pinyin: de), "得" (so that, have to) (pinyin: de). Finally, a rule based system is exploited to solve the pronoun usage confusions: "她" (she) (pinyin: ta), "他" (he) (pinyin: ta) and some others fixed collocation errors. The proposed model is evaluated on the standard data set released by the SIGHAN Bake-off 2014 shared task, and gives competitive result. * This work was partially supported by the National Natural Science Foundation of China (No. 60903119, No. 61170114, and No. 61272248) (CSC fund 201304490199 and 201304490171), and the art and science interdiscipline funds of Shanghai Jiao Tong University (A study on mobilization mechanism and alerting threshold setting for online community, and media image and psychology evaluation: a computational intelligence approach).
This paper describes the system of Shanghai Jiao Tong Unvierity team in the CoNLL-2014 shared task. Error correction operations are encoded as a group of predefined labels and therefore the task is formulized as a multi-label classification task. For training, labels are obtained through a strict rule-based approach. For decoding, errors are detected and corrected according to the classification results. A single maximum entropy model is used for the classification implementation incorporated with an improved feature selection algorithm. Our system achieved precision of 29.83, recall of 5.16 and F 0.5 of 15.24 in the official evaluation.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.