This exploratory study proposes a prototype sentence-level parallel corpus to support studying optical character recognition (OCR) quality in curated digitized library collections. Existing data resources, such as ICDAR2019 [21] and GT4HistOCR[23], generally aligned content by artifact publishing characteristics such as documents or lines, which is limited to explore OCR noise concentrating on natural language granularity like sentences and chapters. Building upon an existing volume-aligned corpus that collected human-proofread texts from Project Gutenberg and paired OCR views from HathiTrust Digital Library, we extracted and aligned 167,079 sentences from 189 sampled books in four domains published from 1793 to 1984. To support downstream research on OCR quality, we conducted an analysis of OCR errors with a specific focus on their associations with the source text metadata. We found that sampled data in agriculture has a higher ratio of real-word errors than other domains, while sentences from social-science volumes contain more non-word errors. Besides, data sampled from early-age volumes tend to have a high ratio of non-word errors, while samples from recently-published volumes is likely to have more real-word errors. Following our findings, we suggest that scholars should consider the potential influence of source data characteristics on their findings in the study of OCR quality issues.
CCS CONCEPTS• Information systems → Digital libraries and archives; • Applied computing → Document management and text processing; Document capture.
Linked open data technologies seem likely to offer significant benefits for digital library users. So far, however, these benefits remain largely speculative despite growing momentum toward the application of linked data in libraries. This paper reports on an exploratory user study to evaluate the use of linked data with digital library special collections. This pilot study assessed features designed to contextualize search results and representations of primary sources, and features that leverage linked data to offer analytic insights. Digital library users were most excited by applications of linked data that provide new, analytic information, and which simultaneously serve to open up new avenues for research within and beyond individual collections.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.