2020
DOI: 10.3897/rio.6.e55789
|View full text |Cite|
|
Sign up to set email alerts
|

Towards a scientific workflow featuring Natural Language Processing for the digitisation of natural history collections

Abstract: We describe an effective approach to automated text digitisation with respect to natural history specimen labels. These labels contain much useful data about the specimen including its collector, country of origin, and collection date. Our approach to automatically extracting these data takes the form of a pipeline. Recommendations are made for the pipeline's component parts based on some of the state-of-the-art technologies. Optical Character Recognition (OCR) can be used to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
1

Relationship

3
4

Authors

Journals

citations
Cited by 11 publications
(9 citation statements)
references
References 25 publications
(12 reference statements)
0
9
0
Order By: Relevance
“…Harnessing technologies developed to harvest, organise, analyse and enhance information from sources such as scholarly literature, third-party databases, data aggregators, data linkage services and geocoders and reapplying these approaches to specimens' labels and other artefacts offers the prospect of greatly accelerated data capture in a computable form [18]. Tools of particular interest span the fields of computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation.…”
Section: The Specimen Data Refi Nery: a Canonical Workfl Ow Framework...mentioning
confidence: 99%
See 1 more Smart Citation
“…Harnessing technologies developed to harvest, organise, analyse and enhance information from sources such as scholarly literature, third-party databases, data aggregators, data linkage services and geocoders and reapplying these approaches to specimens' labels and other artefacts offers the prospect of greatly accelerated data capture in a computable form [18]. Tools of particular interest span the fields of computer vision, optical character recognition, handwriting recognition, named entity recognition and language translation.…”
Section: The Specimen Data Refi Nery: a Canonical Workfl Ow Framework...mentioning
confidence: 99%
“…While natural history collections are heterogeneous in size and shape, often they are mass digitized using standardised workflows [9,10,11,12,13]. In pursuit of higher throughput at lower cost, yet with higher accuracy and richer metadata, further automation will increasingly rely on techniques of object detection and segmentation, optical character recognition (OCR) and semantic processing of labels, and automated taxonomic identification and visual feature analysis [1,18].…”
Section: Wo Rkflows For Processing Specimen Images and Extracting Datamentioning
confidence: 99%
“…So while there is a strong foundation of methodologies from which to build on, species identification will still require considerable input. (Owen et al 2020).…”
Section: Condition Checking Image Trait Extraction and Species Identmentioning
confidence: 99%
“…These can be prevalent in collections, and aid with data linkage and verification. Owen et al (2020) demonstrated that both geographic and person information can be accurately extracted using OCR. Person resolution was marked as amber because Bionomia (formerly Bloodhound) is currently the only tool designed specifically to match a collector with the specimens they collected.…”
Section: Atomization Validation and Classificationmentioning
confidence: 99%
“…Here, we summarize several studies we have conducted on the digital transcription of biological specimen data from their associated specimen labels. We draw on trials we have conducted on the automated and manual transcription of specimen labels (26, 27). We have also investigated how specimen data are shared by institutions and how they store these data in their collection management systems (18).…”
Section: Introductionmentioning
confidence: 99%