A comprehensive evaluation methodology for noisy historical document recognition techniques

Stamatopoulos, Nikolaos; Louloudis, Georgios; Gatos, Basilis

doi:10.1145/1568296.1568306

Cited by 8 publications

(5 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In black pixel projection histogram (N. Stamatopoulos and Gatos, 2009), characters are separated by cut-ting at turning points of the cross direction histogram. Figure 4 shows the black pixel projection histogram of a Kanji character.…”

Section: Black Pixel Projection Histogrammentioning

confidence: 99%

“…As far as we know, any ruby removal methods for earlymodern Japanese printed books have not been studied. As for existing methods to remove ruby characters from current books with standard typography, there are two main methods (N. Stamatopoulos and Gatos, 2009) (Fletcher and Kasturi, 1988): (1) Separating ruby characters linearly using density histogram and (2) separating ruby characters using circumscription rectangles. Both methods assume the standard typography for the target books.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters

Awazu

Fukuo

Takata

et al. 2014

Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods

View full text Add to dashboard Cite

The web site of National Diet Library in Japan provides a lot of early-modern (AD1868-1945) Japanese printed books to the public, but full-text search is essentially impossible. In order to perform advanced search for historical literatures, the automatic textualization of the images is required. However, the ruby system, which is peculiar to Japanese books, gives a serious obstacle against the textualization. When we apply existing OCRs to early-modern Japanese printed books, the recognition rate is extremely low. To solve this problem, we have already proposed a multi-font Kanji character recognition method using the PDC feature and an SVM. In this paper, we propose a ruby character removal method for early-modern Japanese printed books using genetic programming, and evaluate our multi-fonts Kanji character recognition method with 1,000 types of early-modern Japanese printed Kanji characters.

show abstract

Section: Black Pixel Projection Histogrammentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters

Awazu

Fukuo

Takata

et al. 2014

Proceedings of the 3rd International Conference on Pattern Recognition Applications and Methods

View full text Add to dashboard Cite

show abstract

“…Even in relatively recent (e.g. early twentieth century) documents, typography, printing and language can differ widely from modern usage [3].…”

Section: Introduction: the Problemmentioning

confidence: 99%

Ocropodium: open source OCR for small-scale historical archives

Blanke

Bryant

Hedges

2011

Journal of Information Science

View full text Add to dashboard Cite

Large-scale digitization projects dealing with text-based historical material face challenges that are not well catered for by commercial software. This article discusses the results of a project to build a scalable OCR workflow for historical collections based on open source tools that is particularly tailored towards use in small-scale historical archives. It argues that open source tools allow for better customization to match these requirements, particularly with regard to character model training and per-project language modelling. We offer insights into our accuracy evaluation results of various open source OCR tools, as well as a case study about the challenges and opportunities of open source OCR in historical archives.

show abstract

“…Other tools are however oriented to evaluate the accuracy in the interpretation of the content (the printed characters and words). This type of evaluation compares the output with a reference which contains a highly accurate transcription (or ground truth) of the source document [12]. The creation of such ground truth is expensive but usually a limited size one is enough to obtain significant numbers provided that it contains a representative sample of the collection.…”

mentioning

confidence: 99%

An open-source OCR evaluation tool

Carrasco

2014

Proceedings of the First International Conference on Digital Access to Textual Cultural Heritage

View full text Add to dashboard Cite

This paper describes an open-source tool which computes statistics of the differences between a reference text an the output of an OCR engine. It also facilitates the spotting of mismatches by generating an aligned bitext where the differences are highlighted and cross linked.The tool accepts a variety of input formats (both for the reference and the OCR) and can also be also used to compare the output of two different OCR engines. Some considerations on the criteria to compare the textual content of two files, at character and word level, are also discussed here.

show abstract

A comprehensive evaluation methodology for noisy historical document recognition techniques

Cited by 8 publications

References 21 publications

A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters

A Multi-fonts Kanji Character Recognition Method for Early-modern Japanese Printed Books with Ruby Characters

Ocropodium: open source OCR for small-scale historical archives

An open-source OCR evaluation tool

Contact Info

Product

Resources

About