2017
DOI: 10.3390/jimaging3040062
|View full text |Cite
|
Sign up to set email alerts
|

DocCreator: A New Software for Creating Synthetic Ground-Truthed Document Images

Abstract: Most digital libraries that provide user-friendly interfaces, enabling quick and intuitive access to their resources, are based on Document Image Analysis and Recognition (DIAR) methods. Such DIAR methods need ground-truthed document images to be evaluated/compared and, in some cases, trained. Especially with the advent of deep learning-based approaches, the required size of annotated document datasets seems to be ever-growing. Manually annotating real documents has many drawbacks, which often leads to small r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
39
0
1

Year Published

2018
2018
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 50 publications
(40 citation statements)
references
References 53 publications
0
39
0
1
Order By: Relevance
“…More specifically, we extracted raw text from the test set and converted it into images. These images have been contaminated used the DocCreator tool 1 developed by Journet et al [7]. The tool allowed to add four common types of OCR degradation related to storage conditions or poor quality of printing materials that may be present on digital libraries: Character degradation adds small ink spots on characters in order to simulate degradation due to the age of the document or the use of an incorrectly set scanner.…”
Section: Methodology and Resultsmentioning
confidence: 99%
“…More specifically, we extracted raw text from the test set and converted it into images. These images have been contaminated used the DocCreator tool 1 developed by Journet et al [7]. The tool allowed to add four common types of OCR degradation related to storage conditions or poor quality of printing materials that may be present on digital libraries: Character degradation adds small ink spots on characters in order to simulate degradation due to the age of the document or the use of an incorrectly set scanner.…”
Section: Methodology and Resultsmentioning
confidence: 99%
“…We have used the DocCreator tool [12] to add four common types of OCR degradation that may be present on digital libraries material:…”
Section: Simulated Data Setsmentioning
confidence: 99%
“…Each @ in the ground truth indicates the insertion of one character by the OCR while @ in the noisy text indicates that one character has been deleted from the original text. All data used in this work are available for public 12 . In order to evaluate the quality of the text generated by an OCR engine, one of the most popular measures is Character Error Rate (CER) [11].…”
Section: Simulated Data Setsmentioning
confidence: 99%
“…However, in some cases, cursive fonts may not be available and the image structure can be hard to capture when there are a lot of variations in the dataset. Similarly, Journet et al [11] generate synthetic documents by extracting layout and characters using the Tesseract OCR. Then, the image is inpainted to recover the background image.…”
Section: Related Workmentioning
confidence: 99%