Large collections of historical biodiversity expeditions are housed in natural history museums throughout the world. Potentially they can serve as rich sources of data for cultural historical and biodiversity research. However, they exist as only partially catalogued specimen repositories and images of unstructured, non-standardised, handwritten text and drawings. Although many archival collections have been digitised, disclosing their content is challenging. They refer to historical place names and outdated taxonomic classifications and are written in multiple languages. Efforts to transcribe the handwritten text can make the content accessible, but semantically describing and interlinking the content would further facilitate research. We propose a semantic model that serves to structure the named entities in natural history archival collections. In addition, we present an approach for the semantic annotation of these collections whilst documenting their provenance. This approach serves as an initial step for an adaptive learning approach for semi-automated extraction of named entities from natural history archival collections. The applicability of the semantic model and the annotation approach is demonstrated using image scans from a collection of 8,000 field book pages gathered by the Committee for Natural History of the Netherlands Indies between 1820 and 1850, and evaluated together with domain experts from the field of natural and cultural history.
Large and important parts of cultural heritage are stored in archives that are difficult to access, even after digitization. Documents and notes are written in hard-to-read historical handwriting and are often interspersed with illustrations. Such collections are weakly structured and largely inaccessible to a wider public and scholars. Traditionally, humanities researchers treat text and images separately. This separation extends to traditional handwriting recognition systems. Many of them use a segmentation free OCR approach which only allows the resolution of homogenous manuscripts in terms of layout, style and linguistic content. This is in contrast to our infrastructure which aims to resolve heterogeneous handwritten manuscript pages in which different scripts and images are narrowly intertwined. Authors in our use case, a 17,000 page account of exploration of the Indonesian Archipelago between 1820-1850 ("Natuurkundige Commissie voor Nederlands-Indië") tried to follow a semantic way to record their knowledge and observations, however, this discipline does not exist in the handwriting script. The use of different languages, such as German, Latin, Dutch, Malay, Greek, and French makes interpretation more challenging. Our infrastructure takes the state-of-the-art word retrieval system MONK as starting point. Owing to its visual approach, MONK can handle the diversity of material we encounter in our use case and many other historical collections: text, drawings and images. By combining text and image recognition, we significantly transcend beyond the state-of-the art, and provide meaningful additions to integrated manuscript recognition. This paper describes the infrastructure and presents early results.
In this paper, scientific species names from images of handwritten species observations are automatically recognised and annotated with semantic concepts, so that they can be used for document retrieval and faceted search. Until now, automated semantic annotation of such named entities was only applied to printed or digital text. We employ a two-step approach. First, word images are classified, identifying elements of scientific species names; Genus, species, author, using (i) visual structural features, (ii) position, and (iii) context. Second, the identified species names are semantically annotated according to the NHC-Ontology, an ontology that describes species observations. Internationalised Resource Identifiers (IRIs) are assigned to the elements so that they can be linked and disambiguated at a later stage by individual researchers. For the identification of scientific species names, we achieve an average F1 score of 0.86. Moreover, we discuss how our method will function in a semi-automated annotation process, with a fruitful dialogue between system and user as the main objective.
Museums, archives and digital libraries make increasing use of Semantic Web technologies to enrich and publish their collection items. The contents of those items, however, are not often enriched in the same way. Extracting named entities within historical manuscripts and disclosing the relationships between them would facilitate cultural heritage research, but it is a labour-intensive and time-consuming process, particularly for handwritten documents.It requires either automated handwriting recognition techniques, or manual annotation by domain experts before the content can be semantically structured. Different workflows have been proposed to address this problem, involving full-text transcription and named entity extraction, with results ranging from unstructured files to semantically annotated knowledge bases. Here, we detail these workflows and describe the approach we have taken to disclose historical biodiversity data, which enables the direct labelling and semantic annotation of document images in hand-written archives.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.