2016
DOI: 10.1007/s11042-016-4185-5
|View full text |Cite
|
Sign up to set email alerts
|

Learning string distance with smoothing for OCR spelling correction

Abstract: Large databases of scanned documents (medical records, legal texts, historical documents) require natural language processing for retrieval and structured information extraction. Errors caused by the optical character recognition (OCR) system increase ambiguity of recognized text and decrease performance of natural language processing. The paper proposes OCR post correction system with parametrized string distance metric. The correction system learns specific error patterns from incorrect words and common sequ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(7 citation statements)
references
References 38 publications
(47 reference statements)
0
7
0
Order By: Relevance
“…In particular, the similarity between crawled pages and the given input text is controlled based on the normalised cosine distance. For each token in the input, they select lexical words whose LV distances to the input Poncelas et al [124], Hládek et al [63], and Généreux et al [52] employ similar techniques to detect errors and suggest correction candidates. Particularly, they detect noisy tokens by a lexiconlookup and select candidates based on LV distances between a given error and lexicon-entries.…”
Section: Context-dependent Approachesmentioning
confidence: 99%
See 2 more Smart Citations
“…In particular, the similarity between crawled pages and the given input text is controlled based on the normalised cosine distance. For each token in the input, they select lexical words whose LV distances to the input Poncelas et al [124], Hládek et al [63], and Généreux et al [52] employ similar techniques to detect errors and suggest correction candidates. Particularly, they detect noisy tokens by a lexiconlookup and select candidates based on LV distances between a given error and lexicon-entries.…”
Section: Context-dependent Approachesmentioning
confidence: 99%
“…Poncelas et al [124] rank the correction suggestions based on word 5-gram language model built from Europarl-v9 corpus. 9 Hládek et al [63] use HMM with state transition probability as word bigram language model probability and observation probability as their smoothing string distance for choosing the best candidate. Généreux et al [52] choose the most probable candidate by a sum of the following feature values: confusion weight, candidate frequency, and bigram frequency.…”
Section: Context-dependent Approachesmentioning
confidence: 99%
See 1 more Smart Citation
“…Spelling correction is part of postprocessing of the digitized document because OCR systems are usually proprietary and difficult to adapt. Typical error patterns appear in OCR texts [8]. The standard set for evaluation of an OCR spelling correction system is the TREC-5 Confusion Track [9].…”
Section: Spelling Errorsmentioning
confidence: 99%
“…If the training corpus is sparse (which it almost always is), the learning process brings the problem of overfitting. Hládek et al [8] proposed a method for smoothing parameters in a letter-confusion matrix. Bilenko and Mooney [149] extended string-distance learning with an affine gap penalty (allowing for random sequences of characters to be skipped).…”
Section: Learning String Metricsmentioning
confidence: 99%