2017
DOI: 10.3138/mous.14.3-3
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale Optical Character Recognition of Ancient Greek

Abstract: This paper documents our campaign to undertake the large-scale optical character recognition of ancient, or polytonic, Greek. Building upon the Gamera OCR engine and developing a suite of post-processing tools, including automatic spellcheck, we processed 1,200 volumes comprising 329,002,271 Greek words. A sample of 10 pages is studied in detail; they demonstrate the degree to which each step of post-processing improved the results, and with which source documents. These pages attain an average character accur… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
6
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(6 citation statements)
references
References 10 publications
0
6
0
Order By: Relevance
“…OCR techniques have been applied on PG collection by Perseus Digital Library (PDL) [5] and the raw textual data [6] is offered to researchers with an interest in Ancient Greek and Latin for further study. OCR applications are also examined in the works of Robertson et al (2014) and Robertson and Boschetti (2017), while an OCR tool [7] is introduced by Robertson. Google has also applied OCR (Fujii, 2018) on the published volumes and its search engine offers links to some of the PG's terms.…”
Section: Optical Character Recognition and Word Spotting Techniquesmentioning
confidence: 99%
“…OCR techniques have been applied on PG collection by Perseus Digital Library (PDL) [5] and the raw textual data [6] is offered to researchers with an interest in Ancient Greek and Latin for further study. OCR applications are also examined in the works of Robertson et al (2014) and Robertson and Boschetti (2017), while an OCR tool [7] is introduced by Robertson. Google has also applied OCR (Fujii, 2018) on the published volumes and its search engine offers links to some of the PG's terms.…”
Section: Optical Character Recognition and Word Spotting Techniquesmentioning
confidence: 99%
“…Another open source OCR software that was adapted to work on polytonic Greek is Gamera. A substantially extended version of Gamera became the basis for Rigaudon [16], a processing pipeline developed for large scale OCR that employs image pre-processing and OCR post-processing to improve the accuracy of recognition, leading to an average character accuracy of 96%. Finally, unlike the majority of previously discussed OCR systems, which require character segmentation, Katsouros et al [11] have developed a segmentationfree method that uses Hidden Markov Models (HMMs) to OCR text lines, reaching an average character accuracy of 92.39% on the GRPOLY-DB dataset [8].…”
Section: Related Workmentioning
confidence: 99%
“…We therefore computed character error rates (CER) by using coordinates-based alignment of words, which allowed for a fast, region-based evaluation without need for carefully aligned text and line-images. 16 For comparison with other research on historical OCR and on the impact of OCR quality on downstream NLP tasks, we report scores according to two additional metrics: PRImA TextEval's [3] bag-of-words F1-score and normalized Levenshtein distance between ground truth and OCR output.…”
Section: Evaluation Settingmentioning
confidence: 99%
“…These diacritics help the reader on how the words are pronounced and emphasized. These diacritics are the accute, grave and circumflex accent, smooth and rough breathing, the subscript, the diaeresis and are shown briefly in Table 1 This particularity creates additional difficulties, mainly at the level of character segmentation and consequently in their recognition with traditional OCR systems [37][17] [10]. Moreover, the text degradation such as stained paper, faded ink, connected characters as well as the scanning process that introduces skewing, low contrast, warping effects e.t.c.…”
Section: Related Work For Word-spotting and Particularities Of Patrologia Graecamentioning
confidence: 99%
“…PG has been digitized and is available on the Web by the large scale digitization project of Google [52] [46] and others [36][51] [53]. However, the access of PG for semantic navigation or simple searching is very limited.…”
Section: Introductionmentioning
confidence: 99%