2011 International Conference on Document Analysis and Recognition 2011
DOI: 10.1109/icdar.2011.138
|View full text |Cite
|
Sign up to set email alerts
|

Error Correction with In-domain Training across Multiple OCR System Outputs

Abstract: Optical character recognition (OCR) systems differ in the types of errors they make, particularly in recognizing characters from degraded or poor quality documents. The problem is how to correct these OCR errors, which is the first step toward more effective use of the documents in digital libraries. This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain tra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2011
2011
2024
2024

Publication Types

Select...
3
2
2

Relationship

1
6

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 13 publications
0
5
0
Order By: Relevance
“…Previous work 2,13,26 using the Eisenhower Communiqués test set and the Enron training set was focused on individual document performance in which corpus WERs were calculated as the average of the individual document WERs. A result of this method is that corpus statistics would give more weight to the the tokens of a short document than to the tokens of a long document.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Previous work 2,13,26 using the Eisenhower Communiqués test set and the Enron training set was focused on individual document performance in which corpus WERs were calculated as the average of the individual document WERs. A result of this method is that corpus statistics would give more weight to the the tokens of a short document than to the tokens of a long document.…”
Section: Resultsmentioning
confidence: 99%
“…Our previous work 13,26 showed the improvement in the WER using a trained machine learner with the alignment of the output from multiple OCR engines. Reinterpreting these previous results using micro-averaging techniques, the underlined entries in Table 4 show the decreasing WER as additional OCR outputs are added to the alignment.…”
Section: Baseline Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…19 Despite this warning, since the early 2010s the text recognition community has said that "the OCR of modern clean documents is e ectively a solved problem" although "older degraded documents present di culties." 20 HTR is mentioned only in passing in Digital Scholarly Editing: Theories and Practices as research that "has not yet produced reliably working products, but much more is to be expected in the coming years." 21 However, there was an implication that digital editions that depend on any such automated processes are not "proper," 22 which may speak to issues of scholarly ownership of the entire editing process, given that the mechanisms of many OCR and HTR technologies remain black-box.…”
Section: Htr and Scholarly Editingmentioning
confidence: 99%
“…A degradation model was proposed and analyzed by [13] to recognize the similarity between groups of ruined characters. The authors of [14] proposed a method of reducing word error rate using different OCR engines and the usage of in domain training was explored. An iterative training framework was proposed by [15] which uses OCR without segmentation to reduce the error in character recognition.…”
Section: Optical Character Recognition (Ocrmentioning
confidence: 99%