Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape

Yu, Junjie; Li, Zhenghua

doi:10.3115/v1/w14-6835

Cited by 79 publications

(69 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Most CSC related studies have emerged as a result of a series of shared tasks (Wu et al, 2013;Tseng et al, 2015;Fung et al, 2017;Gaoqi et al, 2018), which involve automatic detection and correction of spelling errors for a given sentence. Earlier work in CSC focus mainly on unsupervised methods such as language model with a pre-constructed confusionset Yu and Li, 2014). Subsequently, some work cast CSC as a sequential labeling problem, in which conditional random fields (CRF) (Lafferty et al, 2001), gated recurrent networks (Hochreiter and Schmidhuber, 1997;Chung et al, 2014) have been employed to model the problem (Zheng et al, 2016;Xie et al, 2017;Wu et al, 2018).…”

Section: Related Workmentioning

confidence: 99%

Confusionset-guided Pointer Networks for Chinese Spelling Check

Wang¹,

Tay²,

Zhong³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

This paper proposes Confusionset-guided Pointer Networks for Chinese Spell Check (CSC) task. More concretely, our approach utilizes the off-the-shelf confusionset for guiding the character generation. To this end, our novel Seq2Seq model jointly learns to copy a correct character from an input sentence through a pointer network, or generate a character from the confusionset rather than the entire vocabulary. We conduct experiments on three human-annotated datasets, and results demonstrate that our proposed generative model outperforms all competitor models by a large margin of up to 20% F1 score, achieving state-of-the-art performance on three datasets.

show abstract

Section: Related Workmentioning

confidence: 99%

Confusionset-guided Pointer Networks for Chinese Spelling Check

Wang¹,

Tay²,

Zhong³

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

show abstract

“…Ref [15] proposed that text correction requires two steps of error detection and error correction. Traditional OCR corrections are more often use language models and confusion matrices [6], [16], [17]. However, after the document OCR, the text will be lost due to occlusion, watermark, etc.…”

Section: Related Workmentioning

confidence: 99%

A Neural Network Architecture for Information Extraction in Chinese Drug Package Insert

Zhou¹,

Chang²,

Li³

2020

IEEE Access

View full text Add to dashboard Cite

There is a lot of useful information in the medical photocopying materials. The correct extraction and identification of this information are of great significance for the construction of digital medical. In most previous research, researchers have been working on clinical data, and there is little discussion on the extraction of information from Chinese drug package insert. To settle this issue, a neural network model is proposed in this paper. This model uses OCR's post-document as the data source, which can not only correct these data but also classify sentences. It is mainly composed of three layers: the first layer is employed to correct the data using the language model and the seq2seq model, the second layer is defined by convolution neural network (CNN) aiming to enrich the processed sentences, and another layer is used to determine the label of each sentence. The quantitative experimental results verify the feasibility and validity of the proposed model. In addition, the comparing experiments demonstrate that our method outperforms the regular rule-based approaches, which indicated 4%-6% higher in F1 score. INDEX TERMS Chinese medical photocopying, neural network, OCR post correction, seq2seq model, sentence classification, convolutional neural network.

show abstract

“…Misspelling detection research is very limited, small scale and often on domain specific private data (Zamora et al, 1981). Approaches for misspelling detection primarily involve use of a predefined dictionary of n-grams (Zamora et al, 1981) and words (Dalkiliç and Ç ebi, 2009;Yu and Li, 2014;Attia et al, 2012). Additionally, dictionaries used are limited to specific languages like Chinese (Yu and Li, 2014), Turkish (Dalkiliç and Ç ebi, 2009) and Arabic (Attia et al, 2012).…”

Section: Related Workmentioning

confidence: 99%

“…Leveraging existing approaches for misspelling detection from product images is beset with a number of challenges. First, although spelling research has intrigued the NLP community for long (Damerau, 1964;Kukich, 1992), misspelling detection research (Zamora et al, 1981;Dalkiliç and Ç ebi, 2009;Attia et al, 2012;Yu and Li, 2014) is very sparse, language specific and the primary approach has remained a dictionary lookup. This approach does not scale or generalize to billions of product images leading to a large number of false positive detections.…”

Section: Introductionmentioning

confidence: 99%

Misspelling Detection from Noisy Product Images

Rao

Shen²

2020

Proceedings of the 28th International Conference on Computational Linguistics: Industry Track

View full text Add to dashboard Cite

Misspellings are introduced on products either due to negligence or as an attempt to deliberately deceive stakeholders. This leads to a revenue loss for online sellers and fosters customer mistrust. Existing spelling research has primarily focused on advancement in misspelling correction and the approach for misspelling detection has remained the use of a large dictionary. The dictionary lookup results in the incorrect detection of several non-dictionary words as misspellings. In this paper, we propose a method to automatically detect misspellings from product images in an attempt to reduce false positive detections. We curate a large scale corpus, define a rich set of features and propose a novel model that leverages importance weighting to account for within class distributional variance. Finally, we experimentally validate this approach on both the curated corpus and an out-of-domain public dataset and show that it leads to a relative improvement of up to 20% in F1 score. The approach thus creates a more robust, generalized deployable solution and reduces reliance on large scale custom dictionaries used today.

show abstract

Chinese Spelling Error Detection and Correction Based on Language Model, Pronunciation, and Shape

Cited by 79 publications

References 8 publications

Confusionset-guided Pointer Networks for Chinese Spelling Check

Confusionset-guided Pointer Networks for Chinese Spelling Check

A Neural Network Architecture for Information Extraction in Chinese Drug Package Insert

Misspelling Detection from Noisy Product Images

Contact Info

Product

Resources

About