2019 ACM/IEEE Joint Conference on Digital Libraries (JCDL) 2019
DOI: 10.1109/jcdl.2019.00015
|View full text |Cite
|
Sign up to set email alerts
|

Deep Statistical Analysis of OCR Errors for Effective Post-OCR Processing

Abstract: Post-OCR is an important processing step that follows optical character recognition (OCR) and is meant to improve the quality of OCR documents by detecting and correcting residual errors. This paper describes the results of a statistical analysis of OCR errors on four document collections. Five aspects related to general OCR errors are studied and compared with human-generated misspellings, including edit operations, length effects, erroneous character positions, real-word vs. non-word errors, and word boundar… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
30
0

Year Published

2020
2020
2021
2021

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 45 publications
(30 citation statements)
references
References 23 publications
(41 reference statements)
0
30
0
Order By: Relevance
“…Particularly for historical texts and despite notable improvements over time (Smith and Cordell, 2018), error rates can be very high, with largely unknown biasing consequences for end users (Alex et al, 2012;Milligan, 2013;Strange et al, 2014;Cordell, 2017;Jarlbrink and Snickars, 2017;Traub et al, 2018;Cordell, 2019). Consequently, assessing and improving OCR quality has been, and still is, a key area for research and development (Alex and Burns, 2014;Ehrmann et al, 2016;Smith and Cordell, 2018;Nguyen et al, 2019;Hakala et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…Particularly for historical texts and despite notable improvements over time (Smith and Cordell, 2018), error rates can be very high, with largely unknown biasing consequences for end users (Alex et al, 2012;Milligan, 2013;Strange et al, 2014;Cordell, 2017;Jarlbrink and Snickars, 2017;Traub et al, 2018;Cordell, 2019). Consequently, assessing and improving OCR quality has been, and still is, a key area for research and development (Alex and Burns, 2014;Ehrmann et al, 2016;Smith and Cordell, 2018;Nguyen et al, 2019;Hakala et al, 2019).…”
Section: Related Workmentioning
confidence: 99%
“…This value fell within the range of 30% -40% reported [14]. The authors in [16] was of the opinion that the average rate of occurrence of non-word error could be put at 67.5%. Our result on ratio of real word to non-word therefore agree with general trends found in other languages.…”
Section: Phonetic Real and Character Errorsmentioning
confidence: 62%
“…According to previous work [13], more than 80% of OCRed errors have an edit distance less than 3. We apply this feature to remove some irrelevant candidates.…”
Section: Error Correctionmentioning
confidence: 83%