2020
DOI: 10.1007/978-3-030-45442-5_13
|View full text |Cite
|
Sign up to set email alerts
|

Assessing the Impact of OCR Errors in Information Retrieval

Abstract: A significant amount of the textual content available on the Web is stored in PDF files. These files are typically converted into plain text before they can be processed by information retrieval or text mining systems. Automatic conversion typically introduces various errors, especially if OCR is needed. In this empirical study, we simulate OCR errors and investigate the impact that misspelled words have on retrieval accuracy. In order to quantify such impact, errors were systematically inserted at varying rat… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
8
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 16 publications
(9 citation statements)
references
References 10 publications
0
8
0
Order By: Relevance
“…Firstly, experienced OCR quality was the main target of the study instead of studying effects of data-oriented OCR quality, which has been repeated probably tens of times in different collections and languages (e.g. Järvelin et al ., 2016; Bazzo et al ., 2020). As Kumpulainen and Late (2022) show empirically, noise in OCR quality disturbs researchers of historical newspaper collections both in the searching and selection phases of their research.…”
Section: Discussionmentioning
confidence: 99%
See 2 more Smart Citations
“…Firstly, experienced OCR quality was the main target of the study instead of studying effects of data-oriented OCR quality, which has been repeated probably tens of times in different collections and languages (e.g. Järvelin et al ., 2016; Bazzo et al ., 2020). As Kumpulainen and Late (2022) show empirically, noise in OCR quality disturbs researchers of historical newspaper collections both in the searching and selection phases of their research.…”
Section: Discussionmentioning
confidence: 99%
“…Simulated research settings include, e.g. Taghva et al (1996), Savoy and Naji (2011) and Bazzo et al (2020), just to mention a few. The general result of these studies is that worse OCR quality lowers query results clearly.…”
Section: Related Researchmentioning
confidence: 99%
See 1 more Smart Citation
“…Owing to large variations between collections of handwritten/historical documents, recognition models typically need to be re-trained/fine-tuned on annotated samples from the new set of documents that need to be recognized. A recent study by Bazzo et al [21] observes that even a 5% word error has a significant impact on information retrieval from document images that are automatically transcribed using an OCR. In another study based on data in a large digital library, Chiron et al [22] observe that a significant number of user queries are affected by OCR errors.…”
Section: Introductionmentioning
confidence: 99%
“…Owing to large variations between collections of handwritten/historical documents, recognition models typically need to be retrained/fine-tuned on annotated samples from the new set of documents that need to be recognized. A recent study by Bazzo et al [4] observe that even a 5% word error has a significant impact on information retrieval from document images that are automatically transcribed using an OCR. In another study based on data in a large digital library, Chiron et al [13] observe that a significant number of user queries are affected by OCR errors.…”
Section: Introductionmentioning
confidence: 99%