2016
DOI: 10.5120/ijca2016910142
|View full text |Cite
|
Sign up to set email alerts
|

A Survey on Various OCR Errors

Abstract: Research has been carried out in correcting words in OCR text and mainly surrounds around (1) non word errors (2) isolated word error correction and context dependent word correction. Various kinds of techniques have been developed. This papers surveys various techniques in correcting these errors and determines which techniques are better.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0
1

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(5 citation statements)
references
References 11 publications
0
4
0
1
Order By: Relevance
“…Repeating the whole pipeline using a paid OCR resource like Abbyy FineReader, which our library already pays to access, could yield more accurate raw text. If further OCR correction seems necessary, there are various NLP and statistical methods that have been researched for correcting OCR errors computationally (Amrhein & Clematide, 2018;Kumar, 2016). Excluding Latin text and sentence fragments is another non-trivial step.…”
Section: Future Improvementsmentioning
confidence: 99%
“…Repeating the whole pipeline using a paid OCR resource like Abbyy FineReader, which our library already pays to access, could yield more accurate raw text. If further OCR correction seems necessary, there are various NLP and statistical methods that have been researched for correcting OCR errors computationally (Amrhein & Clematide, 2018;Kumar, 2016). Excluding Latin text and sentence fragments is another non-trivial step.…”
Section: Future Improvementsmentioning
confidence: 99%
“…For example, tokenization errors are known to be frequent in the ACL anthology corpus (Nastase and Hitschler, 2018) and in digitized newspapers (Soni et al, 2019;Adesam et al, 2019). There is large body of research on OCR error correction (Kumar, 2016). However, not all methods can deal with tokenization errors, and it is stated in Hämäläinen and Hengchen (2019) that: "A limitation of our approach is that it cannot do word segmentation in case multiple words have been merged together as a result of the OCR process.…”
Section: Sources Of Tokenization Errorsmentioning
confidence: 99%
“…OCR quality. OCR is a challenge for NLP tasks across research areas [9,10], including for scanned biodiversity texts, which often include handwritten field notes [8,11,12]. Particularly for older documents, the text will likely require additional human effort to clean errors and/or transcribe the documents prior to processing.…”
Section: Preliminary Findingsmentioning
confidence: 99%