Proceedings of the 18th ACM/IEEE on Joint Conference on Digital Libraries 2018
DOI: 10.1145/3197026.3197046
|View full text |Cite
|
Sign up to set email alerts
|

Impact of Crowdsourcing OCR Improvements on Retrievability Bias

Abstract: Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

1
13
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(14 citation statements)
references
References 14 publications
1
13
0
Order By: Relevance
“…Tokens with higher word-level confidence (which is positively influenced by successful matching to lists with known Dutch words) have higher retrievability scores [28]. Comparing OCR character error rates on newspapers from the 17th century vs. newspapers from the Second World War, the error rate is clearly higher for the 17th-century collection, negatively affecting retrievability of older articles [27].…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
See 4 more Smart Citations
“…Tokens with higher word-level confidence (which is positively influenced by successful matching to lists with known Dutch words) have higher retrievability scores [28]. Comparing OCR character error rates on newspapers from the 17th century vs. newspapers from the Second World War, the error rate is clearly higher for the 17th-century collection, negatively affecting retrievability of older articles [27].…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…Formal, large-scale OCR quality assessment is an emerging topic of interest at the KB. However, only small annotated ground truth datasets are available so far [27]. As discussed in [27][28][29], OCR quality impacts document retrievability.…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
See 3 more Smart Citations