A synthetic document image dataset for developing and evaluating historical document processing methods

Walker, Daniel D.; Lund, William B.; Ringger, Eric K.

doi:10.1117/12.912203

Cited by 3 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many different degradation effects can be used as defocusing, paper positioning variations, distortion of character strokes, non-uniform illumination, typesetting imperfections, perspective distortion, etc. [22,20,6,25,21,23,26]. These degradation models aim at generating synthetic noise that can be found in the real world and therefore to extend training sets to perform better on unseen scenarios.…”

Section: Noising Methodsmentioning

confidence: 99%

The NoisyOffice Database: A Corpus To Train Supervised Machine Learning Filters For Image Processing

Castro-Bleda

España-Boquera

Pastor-Pellicer

et al. 2019

The Computer Journal

View full text Add to dashboard Cite

This paper presents the ‘NoisyOffice’ database. It consists of images of printed text documents with noise mainly caused by uncleanliness from a generic office, such as coffee stains and footprints on documents or folded and wrinkled sheets with degraded printed text. This corpus is intended to train and evaluate supervised learning methods for cleaning, binarization and enhancement of noisy images of grayscale text documents. As an example, several experiments of image enhancement and binarization are presented by using deep learning techniques. Also, double-resolution images are also provided for testing super-resolution methods. The corpus is freely available at UCI Machine Learning Repository. Finally, a challenge organized by Kaggle Inc. to denoise images, using the database, is described in order to show its suitability for benchmarking of image processing systems.

show abstract

Section: Noising Methodsmentioning

confidence: 99%

The NoisyOffice Database: A Corpus To Train Supervised Machine Learning Filters For Image Processing

Castro-Bleda

España-Boquera

Pastor-Pellicer

et al. 2019

The Computer Journal

View full text Add to dashboard Cite

show abstract

“…Four datasets were used in this work: two test sets, the Eisenhower Communiqués 30 and the Nineteenth Century Mormon Article Newspaper Index; 31 and two training sets, an extraction of the 2001 Topic Annotated Enron Email Data Set and an extraction of the Reuters-21578 Text Categorization Test Collection. 32,33 The following sections describe each dataset and how it was created.…”

Section: Corporamentioning

confidence: 99%

“…For further details on the process, please consult the paper by Walker, Lund, and Ringger (2013). 33…”

Section: Synthetic Training Setsmentioning

confidence: 99%

How well does multiple OCR error correction generalize?

2013

View full text Add to dashboard Cite

As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model.

show abstract