Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Lang 2003
DOI: 10.3115/1073445.1073463
|View full text |Cite
|
Sign up to set email alerts
|

A generative probabilistic OCR model for NLP applications

Abstract: In this paper, we introduce a generative probabilistic optical character recognition (OCR) model that describes an end-to-end process in the noisy channel framework, progressing from generation of true text through its transformation into the noisy output of an OCR system. The model is designed for use in error correction, with a focus on post-processing the output of black-box OCR systems in order to make it more useful for NLP tasks. We present an implementation of the model based on finitestate models, demo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
22
0

Year Published

2007
2007
2016
2016

Publication Types

Select...
6
1

Relationship

0
7

Authors

Journals

citations
Cited by 38 publications
(22 citation statements)
references
References 21 publications
0
22
0
Order By: Relevance
“…We are lucky that in our language model, the French preposition à (English: to) is slightly more probable than the French verb a (English: has); otherwise, we would encounter dozens of additional miscorrections. 7 Word deletions are relatively rare in the evaluation set, but pose a yet unsolved problem to our merging algorithm. In 8 cases, à is regrettably deleted by OmniPage.…”
Section: First Evaluation Of the Ocr Mergingmentioning
confidence: 99%
See 1 more Smart Citation
“…We are lucky that in our language model, the French preposition à (English: to) is slightly more probable than the French verb a (English: has); otherwise, we would encounter dozens of additional miscorrections. 7 Word deletions are relatively rare in the evaluation set, but pose a yet unsolved problem to our merging algorithm. In 8 cases, à is regrettably deleted by OmniPage.…”
Section: First Evaluation Of the Ocr Mergingmentioning
confidence: 99%
“…statistical approaches as described in [10] and [7], as well as lexical approaches as in [13]. As mentioned before, some of our experiments were inspired by Reynaert [10] who worked on cleaning a digitized collection of historical Dutch newspapers.…”
Section: Related Workmentioning
confidence: 99%
“…More similar to the system, by considering an end-to-end generative model, is the one of Kolak et al [68]. They use at run-time a single transducer that takes a sequence of OCR characters as input, and returns a lattice of all possible sequences of real words as output, along with their weights.…”
Section: Context Of Ocr Correctionmentioning
confidence: 99%
“…Several extensions may be formulated: similar to Kolak et al [68], higher-level syntactic information may be built through additional FSMs to handle word segmentation problems for example, or to choose the best output regarding the whole sentence as well. Syntactic information, modelled by an additional machine, with grammatical forms, could be also used to correct real-word errors!…”
Section: Conclusion On Recognition-by-correctionmentioning
confidence: 99%
“…Kukich surveyed various methods to correct words, either in isolation or with context, using natural language processing techniques [13]. Kolak developed a generative model to estimate the true word sequence from noisy OCR output [12]. They assume a generative process that produces words, characters, and word boundaries, in order to model segmentation and character recognition errors of an OCR system.…”
Section: Related Workmentioning
confidence: 99%