2022
DOI: 10.1017/s1351324922000110
|View full text |Cite
|
Sign up to set email alerts
|

In-depth analysis of the impact of OCR errors on named entity recognition and linking

Abstract: Named entities (NEs) are among the most relevant type of information that can be used to properly index digital documents and thus easily retrieve them. It has long been observed that NEs are key to accessing the contents of digital library portals as they are contained in most user queries. However, most digitized documents are indexed through their optical character recognition (OCRed) version which include numerous errors. Although OCR engines have considerably improved over the last few years, OCR errors s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 17 publications
(5 citation statements)
references
References 69 publications
0
5
0
Order By: Relevance
“…Text detection is a crucial component of Optical Character Recognition (OCR) technology [1,2,3]. Traditional text detection methods employ techniques such as binarization and affine transformations to locate text strings.…”
Section: Related Workmentioning
confidence: 99%
“…Text detection is a crucial component of Optical Character Recognition (OCR) technology [1,2,3]. Traditional text detection methods employ techniques such as binarization and affine transformations to locate text strings.…”
Section: Related Workmentioning
confidence: 99%
“…Given this dearth of historical synthetic datasets, there have been several efforts to artificially "age" these newer documents by including effects such as artificial warping, rotation, and simulated dust and random noise on the page. While much focus has been placed on the aging of articles on downstream tasks such as the mining of historical event-related OCR text [24] or named-entity recognition [25], some recent work has focused on the effects of the aging process on the localization of page-objects [10,11] and the generation of new training sets for historical documents [26].…”
Section: Minimal Historical Synthetic Datamentioning
confidence: 99%
“…Van Strien et al [16], for instance, showed that processing low-quality documents impairs the performance of six NLP tasks including sentence segmentation and dependency parsing. Hamdi et al [17] showed that the performance of named entity recognition systems can have a significant drop of F1-score from 90% to 50% for character error rates between 2% and 30%.…”
Section: Related Workmentioning
confidence: 99%