2019 International Conference on Document Analysis and Recognition (ICDAR) 2019
DOI: 10.1109/icdar.2019.00145
|View full text |Cite
|
Sign up to set email alerts
|

Post-OCR Error Detection by Generating Plausible Candidates

Abstract: The accuracy of Optical Character Recognition (OCR) technologies considerably impacts the way digital documents are indexed, accessed and exploited. Post-processing approaches detect and correct remaining errors to improve the quality of OCR texts. However, state-of-the-art approaches still need to be improved. Most of the existing post-OCR techniques use predefined error position lists or apply simple techniques to detect errors. In this paper, we describe a novel error detector using different features from … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
9
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2
1

Relationship

2
4

Authors

Journals

citations
Cited by 12 publications
(11 citation statements)
references
References 14 publications
1
9
1
Order By: Relevance
“…This approach also gets higher results of correctly detected non-word errors with 95% on Periodical, 93% on Comp2019, but not on Monograph (82%). Similarly, the percentage of correctly recognized OOV words is comparable to the ones reported in the related work [12] with about 61% on average on three datasets.…”
Section: Resultssupporting
confidence: 81%
See 2 more Smart Citations
“…This approach also gets higher results of correctly detected non-word errors with 95% on Periodical, 93% on Comp2019, but not on Monograph (82%). Similarly, the percentage of correctly recognized OOV words is comparable to the ones reported in the related work [12] with about 61% on average on three datasets.…”
Section: Resultssupporting
confidence: 81%
“…The rate of correctly detected real-word errors supports our assumption. Our approach is able to identify 64% of context-sensitive errors on Monograph, 63% on Periodical, 48% on Comp2019, which is better than the results reported in the prior work [12] (43% on Monograph, 49% on Periodical, no report on Comp2019). This approach also gets higher results of correctly detected non-word errors with 95% on Periodical, 93% on Comp2019, but not on Monograph (82%).…”
Section: Resultscontrasting
confidence: 72%
See 1 more Smart Citation
“…Second, analogous to previous approaches (Mei et al, 2016;Khirbat, 2017;Nguyen et al, 2019b), we enhanced the feature set with 4 additional features (referred to as the 10-feature model): (1) the actual word (2) the actual word length, (3) context, i.e. the word proceeding and following the actual word, (4) whether the word appears in the word2vec model, here we apply a simple look-up method against the pre-trained model by Hengchen et al (2019).…”
Section: Experiments Setupmentioning
confidence: 80%
“…He reported 69.6% precision, 44.2% recall and 54.1% F1. Nguyen et al (2019b) experimented with 13 character and word features on two datasets of handwritten historical English documents (monograph and periodical) taken from the ICDAR competition (Chiron et al, 2017). The features they have experimented with include char-acter and word n-gram frequencies, part-of-speech, and the frequency of the OCR token in its candidate generation sets which they generated using edit-distance and regression model.…”
Section: Related Workmentioning
confidence: 99%