Large-Scale Optical Character Recognition of Ancient Greek

Robertson, Bruce; Boschetti, Federico

doi:10.3138/mous.14.3-3

Cited by 8 publications

(6 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…OCR techniques have been applied on PG collection by Perseus Digital Library (PDL) [5] and the raw textual data [6] is offered to researchers with an interest in Ancient Greek and Latin for further study. OCR applications are also examined in the works of Robertson et al (2014) and Robertson and Boschetti (2017), while an OCR tool [7] is introduced by Robertson. Google has also applied OCR (Fujii, 2018) on the published volumes and its search engine offers links to some of the PG's terms.…”

Section: Optical Character Recognition and Word Spotting Techniquesmentioning

confidence: 99%

Semantic enrichment on large scanned collections through their “satellite texts”: the paradigm of Migne’s Patrologia Graeca

Varthis

Tzanavaris

Giarenis

et al. 2021

IDD

View full text Add to dashboard Cite

Purpose This paper aims to present a methodology for the semantic enrichment on the scanned collection of Migne’s Patrologia Graeca (PG), attempting to easily locate on the Web domain the scanned PG source, when a reference of this source is described and commented on another scanned or textual document, and to semantically enrich PG through related scanned or textual documents named “satellite texts” published by third people. The present enrichment of PG uses as satellite texts the Dorotheos Scholarios's Synoptic Index (DSSI) which act as metadata for PG. Design/methodology/approach The methodology consists of two parts. The first part addresses the DSSI transcription via a proper web tool. The second part is divided into two subsections: the accomplishment of interlinking the printed column numbers of each scanned PG page with its actual filename, which is the build of a matching function, and the build of a web interface for PG, based on the generated Uniform Resource Identifiers (URIs) of the above first subsection. Findings The result of the implemented methodology is a Web portal, capable of providing server-less search of topics with direct (single click) navigation to sources. The produced system is static, scalable, easy to be managed and requires minimal cost to be completed and maintained. The produced data sets of transcribed DSSI and the JavaScript Object Notation (JSON) matching functions are available for personal use of students and scholars under Creative Commons license (CC-BY-NC-SA). Social implications Scholars or anyone interested in a particular subject can easily locate topics in PG and reference them, using URIs that are easy to remember. This fact contributes significantly to the related scientific dialogue. Originality/value The methodology uses the transcribed satellite texts of DSSI, which act as metadata for PG, to semantically enrich PG collection. Furthermore, the built PG Web interface can be used by other satellite texts as a reference basis to further enrich PG, as it provides a direct identification of sources. The presented methodology is general and can be applied to any scanned collection using its own satellite texts.

show abstract

Section: Optical Character Recognition and Word Spotting Techniquesmentioning

confidence: 99%

Semantic enrichment on large scanned collections through their “satellite texts”: the paradigm of Migne’s Patrologia Graeca

Varthis

Tzanavaris

Giarenis

et al. 2021

IDD

View full text Add to dashboard Cite

show abstract

“…Another open source OCR software that was adapted to work on polytonic Greek is Gamera. A substantially extended version of Gamera became the basis for Rigaudon [16], a processing pipeline developed for large scale OCR that employs image pre-processing and OCR post-processing to improve the accuracy of recognition, leading to an average character accuracy of 96%. Finally, unlike the majority of previously discussed OCR systems, which require character segmentation, Katsouros et al [11] have developed a segmentationfree method that uses Hidden Markov Models (HMMs) to OCR text lines, reaching an average character accuracy of 92.39% on the GRPOLY-DB dataset [8].…”

Section: Related Workmentioning

confidence: 99%

“…We therefore computed character error rates (CER) by using coordinates-based alignment of words, which allowed for a fast, region-based evaluation without need for carefully aligned text and line-images. 16 For comparison with other research on historical OCR and on the impact of OCR quality on downstream NLP tasks, we report scores according to two additional metrics: PRImA TextEval's [3] bag-of-words F1-score and normalized Levenshtein distance between ground truth and OCR output.…”

Section: Evaluation Settingmentioning

confidence: 99%

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs

Romanello¹,

Najem-Meyer²,

Robertson³

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Together with critical editions and translations, commentaries are one of the main genres of publication in literary and textual scholarship, and have a century-long tradition. Yet, the exploitation of thousands of digitized historical commentaries was hitherto hindered by the poor quality of Optical Character Recognition (OCR), especially on commentaries to Greek texts. In this paper, we evaluate the performances of two pipelines suitable for the OCR of historical classical commentaries. Our results show that Kraken + Ciaconna reaches a substantially lower character error rate (CER) than Tesseract/OCR-D on commentary sections with high density of polytonic Greek text (average CER 7% vs. 13%), while Tesseract/OCR-D is slightly more accurate than Kraken + Ciaconna on text sections written predominantly in Latin script (average CER 8.2% vs. 8.4%). As part of this paper, we also release GT4HistComment, a small dataset with OCR ground truth for 19 th classical commentaries and Pogretra, a large collection of training data and pre-trained models for a wide variety of ancient Greek typefaces.

show abstract

“…These diacritics help the reader on how the words are pronounced and emphasized. These diacritics are the accute, grave and circumflex accent, smooth and rough breathing, the subscript, the diaeresis and are shown briefly in Table 1 This particularity creates additional difficulties, mainly at the level of character segmentation and consequently in their recognition with traditional OCR systems [37][17] [10]. Moreover, the text degradation such as stained paper, faded ink, connected characters as well as the scanning process that introduces skewing, low contrast, warping effects e.t.c.…”

Section: Related Work For Word-spotting and Particularities Of Patrologia Graecamentioning

confidence: 99%

“…PG has been digitized and is available on the Web by the large scale digitization project of Google [52] [46] and others [36][51] [53]. However, the access of PG for semantic navigation or simple searching is very limited.…”

Section: Introductionmentioning

confidence: 99%

Automatic metadata extraction via image processing using Migne's Patrologia Graeca

Varthis

Poulos

Giarenis

et al. 2020

IJMSO

View full text Add to dashboard Cite

A wealth of knowledge is kept behind libraries and cultural institutions in various digital forms without however the possibility of a simple term search, let alone of a substantial semantic search. One such important collection that contains knowledge, accumulated in the passage of the ages and remain inaccessible for the greater part, is Patrologia Graeca. So far, little research has been conducted to make this digital collection searchable to a certain degree, in order to retrieve and reveal its gathered knowledge in an efficient way. In this study, a novel approach is proposed which strives towards recognizing words from large printed corpora such as Patrologia Graeca. The proposed framework firstly applies an efficient segmentation process at word level and transforms the word-images of Greek polytonic script of the Patrologia Graeca into special compact shapes. Afterwards the contours of these shapes are extracted and compared with the contour of a similarly transformed query wordimage in order to locate the specific word in the digitized documents. For the comparison, we use a series of three descriptors, Hu's invariant moments for discarding unlikely similar matches, Shape Context for the contour similarity and the Pearson's correlation coefficient for final pruning of the dissimilar words and additional verification. Comparative results are presented by using instead of Pearson's correlation coefficient the Long-Short Term Memory Neural Network engine of Tesseract Optical Character Recognition system. The described framework due to the simplicity and efficiency that provides, can be applied for massive creation of search indexes and consequently semantic enrichment of Patrologia Graeca. The framework has the potential to be applicable for other printed collections with proper configuration of the parameters. An additional and very significant consequence of our method's effectiveness and simplicity is that it can be used as a pre-stage to provide a large number of word-image and label pairs, These pairs can be used for training neural networks or common classifiers such as k-nearest neighbor or state vector machine.

show abstract

Large-Scale Optical Character Recognition of Ancient Greek

Cited by 8 publications

References 10 publications

Semantic enrichment on large scanned collections through their “satellite texts”: the paradigm of Migne’s Patrologia Graeca

Semantic enrichment on large scanned collections through their “satellite texts”: the paradigm of Migne’s Patrologia Graeca

Optical Character Recognition of 19th Century Classical Commentaries: the Current State of Affairs

Automatic metadata extraction via image processing using Migne's Patrologia Graeca

Contact Info

Product

Resources

About