2019
DOI: 10.3390/app9224853
|View full text |Cite
|
Sign up to set email alerts
|

OCR4all—An Open-Source Tool Providing a (Semi-)Automatic OCR Workflow for Historical Printings

Abstract: Optical Character Recognition (OCR) on historical printings is a challenging task mainly due to the complexity of the layout and the highly variant typography. Nevertheless, in the last few years great progress has been made in the area of historical OCR, resulting in several powerful open-source tools for preprocessing, layout recognition and segmentation, character recognition and post-processing. The drawback of these tools often is their limited applicability by non-technical users like humanist scholars a… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3
2

Relationship

0
10

Authors

Journals

citations
Cited by 31 publications
(7 citation statements)
references
References 37 publications
0
5
0
Order By: Relevance
“…In addition to papers improving specific processes involved in OCR, some previous papers present a combination of various OCR techniques for overall improved process accuracy. OCR4all [109] is an open-source OCR software that combines state-of-the-art OCR components and continuous model training into a comprehensive workflow for processing historical printings and provides a user-friendly GUI and extensive configuration capabilities. The software outperforms commercial tools on moderate layouts.…”
Section: Discussionmentioning
confidence: 99%
“…In addition to papers improving specific processes involved in OCR, some previous papers present a combination of various OCR techniques for overall improved process accuracy. OCR4all [109] is an open-source OCR software that combines state-of-the-art OCR components and continuous model training into a comprehensive workflow for processing historical printings and provides a user-friendly GUI and extensive configuration capabilities. The software outperforms commercial tools on moderate layouts.…”
Section: Discussionmentioning
confidence: 99%
“…• Outdated datasets [27]: Outdated or obsolete datasets are another challenge to the existing datasets as most of these datasets contain document structure with an old format/layout. The datasets like RVL-CDIP contain document images with blur, obsolete format, and missing data values, creating problems during the information extraction process.…”
Section: Poor Qualitymentioning
confidence: 99%
“…Data augmentation with artificial GT has been tested, without significant impact. Models have been produced with two different engines offering accessible user interfaces for non-specialists: Kraken [Kiessling 2019]/eScriptorium and Calamari [Wick et al 2018]/OCR4all [Reul et al 2019]. It has to be noted that scores are not strictly comparable (setups and evaluations are different), but both show extremely good results on the in-domain test set.…”
Section: Ocrmentioning
confidence: 99%