Abstract:Digitized document collections often suffer from OCR errors that may impact a document's readability and retrievability. We studied the effects of correcting OCR errors on the retrievability of documents in a historic newspaper corpus of a digital library. We computed retrievability scores for the uncorrected documents using queries from the library's search log, and found that the document OCR character error rate and retrievability score are strongly correlated. We computed retrievability scores for manually… Show more
“…Tokens with higher word-level confidence (which is positively influenced by successful matching to lists with known Dutch words) have higher retrievability scores [28]. Comparing OCR character error rates on newspapers from the 17th century vs. newspapers from the Second World War, the error rate is clearly higher for the 17th-century collection, negatively affecting retrievability of older articles [27].…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…Formal, large-scale OCR quality assessment is an emerging topic of interest at the KB. However, only small annotated ground truth datasets are available so far [27]. As discussed in [27][28][29], OCR quality impacts document retrievability.…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…However, only small annotated ground truth datasets are available so far [27]. As discussed in [27][28][29], OCR quality impacts document retrievability. Tokens with higher word-level confidence (which is positively influenced by successful matching to lists with known Dutch words) have higher retrievability scores [28].…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…Named entities are frequently present in the query logs of the Delpher portal [27]. Generally, knowledge of named entities is beneficial for linked and enriched data access.…”
Section: Semantic Challengesmentioning
confidence: 99%
“…As a means to scale up quality improvements at the OCR and NER linking levels, the KB currently investigates crowdsourcing possibilities [27,31]. This also will be a useful mechanism when seeking to engage music domain experts.…”
“…Tokens with higher word-level confidence (which is positively influenced by successful matching to lists with known Dutch words) have higher retrievability scores [28]. Comparing OCR character error rates on newspapers from the 17th century vs. newspapers from the Second World War, the error rate is clearly higher for the 17th-century collection, negatively affecting retrievability of older articles [27].…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…Formal, large-scale OCR quality assessment is an emerging topic of interest at the KB. However, only small annotated ground truth datasets are available so far [27]. As discussed in [27][28][29], OCR quality impacts document retrievability.…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…However, only small annotated ground truth datasets are available so far [27]. As discussed in [27][28][29], OCR quality impacts document retrievability. Tokens with higher word-level confidence (which is positively influenced by successful matching to lists with known Dutch words) have higher retrievability scores [28].…”
Section: How Much Music Is In the Corpus?mentioning
confidence: 99%
“…Named entities are frequently present in the query logs of the Delpher portal [27]. Generally, knowledge of named entities is beneficial for linked and enriched data access.…”
Section: Semantic Challengesmentioning
confidence: 99%
“…As a means to scale up quality improvements at the OCR and NER linking levels, the KB currently investigates crowdsourcing possibilities [27,31]. This also will be a useful mechanism when seeking to engage music domain experts.…”
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.