OCR Post-processing Using Weighted Finite-State Transducers

Llobet, Rafael; Cerdan-Navarro, Jose-Ramon; Pérez-Cortés, Juan-Carlos; Arlandis, Joaquim

doi:10.1109/icpr.2010.498

Cited by 34 publications

(26 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…To propagate the outputs in the beam efficiently, a number of efficient structures have been devised to compactly encode certain families of distributions [21,17]. Efficient encodings for top-k lists can improve the scalability of our approach as well.…”

Section: Related Workmentioning

confidence: 99%

Beyond myopic inference in big data pipelines

Raman

Swaminathan

Gehrke

et al. 2013

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

Big Data Pipelines decompose complex analyses of large data sets into a series of simpler tasks, with independently tuned components for each task. This modular setup allows re-use of components across several different pipelines. However, the interaction of independently tuned pipeline components yields poor end-to-end performance as errors introduced by one component cascade through the whole pipeline, affecting overall accuracy. We propose a novel model for reasoning across components of Big Data Pipelines in a probabilistically well-founded manner. Our key idea is to view the interaction of components as dependencies on an underlying graphical model. Different message passing schemes on this graphical model provide various inference algorithms to trade-off endto-end performance and computational cost. We instantiate our framework with an efficient beam search algorithm, and demonstrate its efficiency on two Big Data Pipelines: parsing and relation extraction.

show abstract

Section: Related Workmentioning

confidence: 99%

Beyond myopic inference in big data pipelines

Raman

Swaminathan

Gehrke

et al. 2013

Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

show abstract

“…Although the requirements are very different, most basic techniques used in that field can be applied to OCR tasks with little modification. Thus, several works use language modeling techniques for error-correcting applied to OCR and text recognition tasks, either on constrained or unconstrained environments Hull & Srihari (1982); Tong & Evans (1996); Perez-Cortes et al (2000); Kolak & Resnik (2005); Llobet et al (2010). Confidence measures reflecting the likelihood that a given OCR hypothesis belongs to the model are provided by many of them.…”

Section: Large-scale Ocr Systems and Post-processingmentioning

confidence: 99%

“…A technique based on Weighted Finite-State Transducers (WFSTs) combining language, hypothesis and error models has been used to post-process the OCR hypotheses Llobet et al (2010). It is based on a finite-state transducer built from a formal grammar that encodes the strings in the lexicon or language sample.…”

Section: Post-processing Algorithm and Language Models Usedmentioning

confidence: 99%

See 1 more Smart Citation

Batch-adaptive rejection threshold estimation with application to OCR post-processing

Navarro-Cerdan

Arlandis

Llobet

et al. 2015

Expert Systems with Applications

Self Cite

View full text Add to dashboard Cite

An OCR process is often followed by the application of a language model to find the best transformation of an OCR hypothesis into a string compatible with the constraints of the document, field or item under consideration. The cost of this transformation can be taken as a confidence value and compared to a threshold to decide if a string is accepted as correct or rejected in order to satisfy the need for bounding the error rate of the system. Widespread tools like ROC, precision-recall, or error-reject curves, are commonly used along with fixed thresholding in order to achieve that goal. However, those methodologies fail when a test sample has a confidence distribution that differs from the one of the sample used to train the system, which is a very frequent case in post-processed OCR strings (e.g., string batches showing particularly careful handwriting styles in contrast to free styles).In this paper, we propose an adaptive method for the automatic estimation of the rejection threshold that overcomes this drawback, allowing the operator to define an expected error rate within the set of accepted (nonrejected) strings of a complete batch of documents (as opposed to trying to establish or control the probability of error of a single string), regardless of its confidence distribution. The operator (expert) is assumed to know the error rate that can be acceptable to the user of the resulting data. The proposed * Corresponding author at: Instituto Tecnológico de Informática (Universitat Politécnica de Valencia), Tel.: +34 963877242; Fax.: +34 963877239Email addresses: jonacer@iti.upv.es (J. Ramon Navarro-Cerdan), arlandis@iti.upv.es (Joaquim Arlandis), rllobet@iti.upv.es (Rafael Llobet), jcperez@iti.upv.es (Juan-Carlos Perez-Cortes) Preprint submitted to Expert Systems with ApplicationsOctober 13, 2015 system transforms that knowledge into a suitable rejection threshold. The approach is based on the estimation of an expected error vs. transformation cost distribution. First, a model predicting the probability of a cost to arise from an erroneously transcribed string is computed from a sample of supervised OCR hypotheses. Then, given a test sample, a cumulative error vs. cost curve is computed and used to automatically set the appropriate threshold that meets the user-defined error rate on the overall sample. The results of experiments on batches coming from different writing styles show very accurate error rate estimations where fixed thresholding clearly fails. An original procedure to generate distorted strings from a given language is also proposed and tested, which allows the use of the presented method in tasks where no real supervised OCR hypotheses are available to train the system.

show abstract

“…This decreases the time needed for manual post-correction since correct words do not have to be considered as candidates for correction by the human corrector. Llobet et al (2010) combine information from the OCR system output, the error distribution and the language as weighted finite-state transducers. Reffle and Ringlstetter (2013) use global as well as local error information to be able to fine-tune post-correction systems to historical documents.…”

Section: Related Workmentioning

confidence: 99%

Multi-modular domain-tailored OCR post-correction

Schulz¹,

Kuhn²

2017

Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

One of the main obstacles for many Digital Humanities projects is the low data availability. Texts have to be digitized in an expensive and time consuming process whereas Optical Character Recognition (OCR) post-correction is one of the time-critical factors. At the example of OCR post-correction, we show the adaptation of a generic system to solve a specific problem with little data. The system accounts for a diversity of errors encountered in OCRed texts coming from different time periods in the domain of literature. We show that the combination of different approaches, such as e.g. Statistical Machine Translation and spell checking, with the help of a ranking mechanism tremendously improves over singlehanded approaches. Since we consider the accessibility of the resulting tool as a crucial part of Digital Humanities collaborations, we describe the workflow we suggest for efficient text recognition and subsequent automatic and manual postcorrection.

show abstract

OCR Post-processing Using Weighted Finite-State Transducers

Cited by 34 publications

References 11 publications

Beyond myopic inference in big data pipelines

Beyond myopic inference in big data pipelines

Batch-adaptive rejection threshold estimation with application to OCR post-processing

Multi-modular domain-tailored OCR post-correction

Contact Info

Product

Resources

About