2015
DOI: 10.7557/5.3467
|View full text |Cite
|
Sign up to set email alerts
|

Can Morphological Analyzers Improve the Quality of Optical Character Recognition?

Abstract: Optical Character Recognition (OCR) can substantially improve the usability of digitized documents. Language modeling using word lists is known to improve OCR quality for English. For morphologically rich languages, however, even large word lists do not reach high coverage on unseen text. Morphological analyzers offer a more sophisticated approach, which is useful in many language processing applications. is paper investigates language modeling in the open-source OCR engine Tesseract using morphological analy… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
7
0
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
2

Relationship

1
4

Authors

Journals

citations
Cited by 8 publications
(8 citation statements)
references
References 13 publications
0
7
0
1
Order By: Relevance
“…Common OCR errors include punctuation errors, case sensitivity, character format, word meaning and segmentation error where spacings in different line, word or character lead to mis-recognitions of white-spaces [22]. OCR errors may also stem from other sources such as font variation across different materials, historical spelling variations, material quality or language specific to different media texts [1].…”
Section: Ocr Errors and Topic Modelingmentioning
confidence: 99%
See 1 more Smart Citation
“…Common OCR errors include punctuation errors, case sensitivity, character format, word meaning and segmentation error where spacings in different line, word or character lead to mis-recognitions of white-spaces [22]. OCR errors may also stem from other sources such as font variation across different materials, historical spelling variations, material quality or language specific to different media texts [1].…”
Section: Ocr Errors and Topic Modelingmentioning
confidence: 99%
“…However, historical documents still pose a challenge for character recognition and therefore OCR of such documents still does not yield satisfying results. Some of the reasons why historical documents still pose a challenge include font variation across different materials, same words spelled differently, material quality where some documents can have deformations and unavailability of a lexicon of known historical spelling variants [1]. These factors reduce the accuracy of recognition which affects the processing of the documents and, overall, the use of digital libraries.…”
Section: Introductionmentioning
confidence: 99%
“…Työ on jatkoa aiemmille julkaisuille yksittäisten kielellisesti poikkeuksellisesti haastavien aineistojen käsittelystä nykyaikaisin menetelmin. Erityisenä alan pioneerina voidaan pitää Jack Rueteria, jonka aiemmat tutkimukset ovat jo varhain tuoneet uutta tietoa erityisesti morfologisten analysaattoreiden käytöstä tekstintunnistuksessa [28], minkä lisäksi hän on julkaissut ja alkanut avata kohta vuosikymmenen tällaiselle työlle keskeistä aineistoa eri kielillä (katso mm. [24,25,26]).…”
Section: Aiempi Tutkimusunclassified
“…After the release first finite-state transducer for the closely related Komi-Zyrian (Rueter, 2000), it was only obvious that similar work should be done for Erzya Mordvin. Fortunately, over the past decade there has been an increasing number of publications on Erzya, relating to its morphology (Rueter, 2010), its OCR tools (Silfverberg and Rueter, 2015) and universal dependencies (Rueter and Tyers, 2018).…”
Section: Introductionmentioning
confidence: 99%