2020
DOI: 10.1007/978-3-030-54956-5_7
|View full text |Cite
|
Sign up to set email alerts
|

Assessing and Minimizing the Impact of OCR Quality on Named Entity Recognition

Abstract: The accessibility to digitized documents in digital libraries is greatly affected by the quality of document indexing. Among the most relevant information to index, named entities are one of the main entry points used to search and retrieve digital documents. However, most digitized documents are indexed through their OCRed version and OCR errors hinder their accessibility. This paper aims to quantitatively estimate the impact of OCR quality on the performance of named entity recognition (NER). We tested state… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
16
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
3

Relationship

2
6

Authors

Journals

citations
Cited by 23 publications
(17 citation statements)
references
References 27 publications
1
16
0
Order By: Relevance
“…The training on synOCR'd WikiNER/CoNLL gives slightly worse results than NN base too. The corruption of the training data without the usage of any embeddings seems to harm performance drastically, what is in line with the observation of Hamdi et al [13]. It is striking that the training on corrupted/synOCR'd Dutch gives especially bad results for PER compared to French.…”
Section: Cross-domain Setupsupporting
confidence: 83%
See 2 more Smart Citations
“…The training on synOCR'd WikiNER/CoNLL gives slightly worse results than NN base too. The corruption of the training data without the usage of any embeddings seems to harm performance drastically, what is in line with the observation of Hamdi et al [13]. It is striking that the training on corrupted/synOCR'd Dutch gives especially bad results for PER compared to French.…”
Section: Cross-domain Setupsupporting
confidence: 83%
“…We test on a subset of the Dutch Europeana corpus. Hamdi et al [13] show that neural taggers perform better compared to other taggers like the Stanford NER tagger and they also prove that performance decreases drastically if the OCR error rate increases. Piktus et al [19] learn misspelling-oblivious FastText embeddings from synthetic misspellings generated by an error model for partof-speech tagging.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…If sentence segmentation and dependency parsing bear the brunt of low OCR quality, NER is also affected with a significant drop of F-score between good and poor OCR (from 87% to 63% for person entities). Focusing specifically on entity processing, Hamdi et al [79,80] confronted a BiLSTM-based NER model with OCR outputs of the same text but of different qualities and observed a 30 percentage point loss in F-score when the character error rate increased from 7% to 20%. Finally, in order to assess the impact of noisy entities on NER during the CLEF-HIPE-2020 NE evaluation campaign on historical newspapers (HIPE-2020 for short), 7 Ehrmann et al [53] evaluated systems' performances on various entity noise levels, defined as the length-normalised Levenshtein distance between the OCR surface form of an entity and its manual transcription.…”
Section: Character Recognitionmentioning
confidence: 99%
“…Still, we can only consider OCR and HTR to be "research facilitators" as long as they perform within reasonable accuracy ranges. Several studies have shown that inaccuracies in OCRed documents harm information retrieval and text mining techniques like named entity recognition and linking, topic modelling, and language modelling (Alex and Burns, 2014;Chiron et al, 2017;van Strien et al, 2020;Hamdi et al, 2020;Pontes et al, 2019;Hill and Hengchen, 2019). But what do we mean by "reasonable accuracy ranges"?…”
Section: Introductionmentioning
confidence: 99%