2012
DOI: 10.1117/12.912203
|View full text |Cite
|
Sign up to set email alerts
|

A synthetic document image dataset for developing and evaluating historical document processing methods

Abstract: Document images accompanied by OCR output text and ground truth transcriptions are useful for developing and evaluating document recognition and processing methods, especially for historical document images. Additionally, research into improving the performance of such methods often requires further annotation of training and test data (e.g., topical document labels). However, transcribing and labeling historical documents is expensive. As a result, existing real-world document image datasets with such accompa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…Many different degradation effects can be used as defocusing, paper positioning variations, distortion of character strokes, non-uniform illumination, typesetting imperfections, perspective distortion, etc. [22,20,6,25,21,23,26]. These degradation models aim at generating synthetic noise that can be found in the real world and therefore to extend training sets to perform better on unseen scenarios.…”
Section: Noising Methodsmentioning
confidence: 99%
“…Many different degradation effects can be used as defocusing, paper positioning variations, distortion of character strokes, non-uniform illumination, typesetting imperfections, perspective distortion, etc. [22,20,6,25,21,23,26]. These degradation models aim at generating synthetic noise that can be found in the real world and therefore to extend training sets to perform better on unseen scenarios.…”
Section: Noising Methodsmentioning
confidence: 99%
“…Four datasets were used in this work: two test sets, the Eisenhower Communiqués 30 and the Nineteenth Century Mormon Article Newspaper Index; 31 and two training sets, an extraction of the 2001 Topic Annotated Enron Email Data Set and an extraction of the Reuters-21578 Text Categorization Test Collection. 32,33 The following sections describe each dataset and how it was created.…”
Section: Corporamentioning
confidence: 99%
“…For further details on the process, please consult the paper by Walker, Lund, and Ringger (2013). 33…”
Section: Synthetic Training Setsmentioning
confidence: 99%