Proceedings of the 9th ACM/IEEE-CS Joint Conference on Digital Libraries 2009
DOI: 10.1145/1555400.1555437
|View full text |Cite
|
Sign up to set email alerts
|

Improving optical character recognition through efficient multiple system alignment

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
10
0

Year Published

2009
2009
2023
2023

Publication Types

Select...
4
2

Relationship

1
5

Authors

Journals

citations
Cited by 19 publications
(11 citation statements)
references
References 13 publications
1
10
0
Order By: Relevance
“…The texts from the three OCR engines are character aligned using the A* algorithm with the Reverse Dijkstra admissible heuristic described by Lund and Ringger [8]. From this character level alignment we construct a lattice of word hypotheses such that wherever there is agreement across all engines on the location of white space we construct a column of hypotheses.…”
Section: A Baseline Ocr Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…The texts from the three OCR engines are character aligned using the A* algorithm with the Reverse Dijkstra admissible heuristic described by Lund and Ringger [8]. From this character level alignment we construct a lattice of word hypotheses such that wherever there is agreement across all engines on the location of white space we construct a column of hypotheses.…”
Section: A Baseline Ocr Resultsmentioning
confidence: 99%
“…Our research leverages the variation among OCR engines (see Figure 1) and additional features of the OCR hypotheses to improve the output beyond what any single OCR engine is capable of. In this case, where in-domain training data is available, we improve upon our previous work [8] and show how a decision list trained on in-domain data using feature combinations reduces the word error rate beyond what is achieved using consensus voting or dictionary matching alone. Further, we explore using a spell checker to suggest additional words for hypotheses that do not appear in the dictionary or gazetteers..…”
Section: Introductionmentioning
confidence: 84%
See 1 more Smart Citation
“…Ringlstetter et al [27] suggested a method to discriminate character confusions in multilingual texts. Cecotti et al [6] and Lund and Ringger [16] aligned multiple OCR outputs and illustrated strategies for selection. Namboodiri et al [20] and Zhuang and Zhu [32] integrated multi-knowledge with the OCR output in post-processing, such as fixed poetical structures for Indian poetry or semantic lexicons for Chinese texts.…”
Section: Related Workmentioning
confidence: 99%
“…Similarly, Optical Character Recognition (OCR) engines, despite great success in recent attempts such as Google Books or Internet Archive, are not without problems, and often produce error-abundant text data. [13] pointed out that although researchers are having increasing levels of success in digitizing hand-written manuscripts, Table 1: Top words (selected by LDA) for five topics of a small sample of Unlv OCR data set (erroneous words are in italic).…”
Section: Introductionmentioning
confidence: 99%