Natural history collections are leading successful large-scale projects of specimen digitization (images, metadata, DNA barcodes), thereby transforming taxonomy into a big data science. Yet, little effort has been directed towards safeguarding and subsequently mobilizing the considerable amount of original data generated during the process of naming 15,000–20,000 species every year. From the perspective of alpha-taxonomists, we provide a review of the properties and diversity of taxonomic data, assess their volume and use, and establish criteria for optimizing data repositories. We surveyed 4113 alpha-taxonomic studies in representative journals for 2002, 2010, and 2018, and found an increasing yet comparatively limited use of molecular data in species diagnosis and description. In 2018, of the 2661 papers published in specialized taxonomic journals, molecular data were widely used in mycology (94%), regularly in vertebrates (53%), but rarely in botany (15%) and entomology (10%). Images play an important role in taxonomic research on all taxa, with photographs used in >80% and drawings in 58% of the surveyed papers. The use of omics (high-throughput) approaches or 3D documentation is still rare. Improved archiving strategies for metabarcoding consensus reads, genome and transcriptome assemblies, and chemical and metabolomic data could help to mobilize the wealth of high-throughput data for alpha-taxonomy. Because long-term—ideally perpetual—data storage is of particular importance for taxonomy, energy footprint reduction via less storage-demanding formats is a priority if their information content suffices for the purpose of taxonomic studies. Whereas taxonomic assignments are quasifacts for most biological disciplines, they remain hypotheses pertaining to evolutionary relatedness of individuals for alpha-taxonomy. For this reason, an improved reuse of taxonomic data, including machine-learning-based species identification and delimitation pipelines, requires a cyberspecimen approach—linking data via unique specimen identifiers, and thereby making them findable, accessible, interoperable, and reusable for taxonomic research. This poses both qualitative challenges to adapt the existing infrastructure of data centers to a specimen-centered concept and quantitative challenges to host and connect an estimated $ \le $2 million images produced per year by alpha-taxonomic studies, plus many millions of images from digitization campaigns. Of the 30,000–40,000 taxonomists globally, many are thought to be nonprofessionals, and capturing the data for online storage and reuse therefore requires low-complexity submission workflows and cost-free repository use. Expert taxonomists are the main stakeholders able to identify and formalize the needs of the discipline; their expertise is needed to implement the envisioned virtual collections of cyberspecimens. [Big data; cyberspecimen; new species; omics; repositories; specimen identifier; taxonomy; taxonomic data.]
Repeatability of study setups and reproducibility of research results by underlying data are major requirements in science. Until now, abstract models for describing the structural logic of studies in environmental sciences are lacking and tools for data management are insufficient. Mandatory for repeatability and reproducibility is the use of sophisticated data management solutions going beyond data file sharing. Particularly, it implies maintenance of coherent data along workflows. Design data concern elements from elementary domains of operations being transformation, measurement and transaction. Operation design elements and method information are specified for each consecutive workflow segment from field to laboratory campaigns. The strict linkage of operation design element values, operation values and objects is essential. For enabling coherence of corresponding objects along consecutive workflow segments, the assignment of unique identifiers and the specification of their relations are mandatory. The abstract model presented here addresses these aspects, and the software DiversityDescriptions (DWB-DD) facilitates the management of thusly connected digital data objects and structures. DWB-DD allows for an individual specification of operation design elements and their linking to objects. Two workflow design use cases, one for DNA barcoding and another for cultivation of fungal isolates, are given. To publish those structured data, standard schema mapping and XML-provision of digital objects are essential. Schemas useful for this mapping include the Ecological Markup Language, the Schema for Meta-omics Data of Collection Objects and the Standard for Structured Descriptive Data. Data pipelines with DWB-DD include the mapping and conversion between schemas and functions for data publishing and archiving according to the Open Archival Information System standard. The setting allows for repeatability of study setups, reproducibility of study results and for supporting work groups to structure and maintain their data from the beginning of a study. The theory of ‘FAIR++’ digital objects is introduced.
Plants, fungi and algae are important components of global biodiversity and are fundamental to all ecosystems. They are the basis for human well-being, providing food, materials and medicines. Specimens of all three groups of organisms are accommodated in herbaria, where they are commonly referred to as botanical specimens. The large number of specimens in herbaria provides an ample, permanent and continuously improving knowledge base on these organisms and an indispensable source for the analysis of the distribution of species in space and time critical for current and future research relating to global biodiversity. In order to make full use of this resource, a research infrastructure has to be built that grants comprehensive and free access to the information in herbaria and botanical collections in general. This can be achieved through digitization of the botanical objects and associated data. The botanical research community can count on a long-standing tradition of collaboration among institutions and individuals. It agreed on data standards and standard services even before the advent of computerization and information networking, an example being the Index Herbariorum as a global registry of herbaria helping towards the unique identification of specimens cited in the literature. In the spirit of this collaborative history, 51 representatives from 30 institutions advocate to start the digitization of botanical collections with the overall wall-to-wall digitization of the flat objects stored in German herbaria. Germany has 70 herbaria holding almost 23 million specimens according to a national survey carried out in 2019. 87% of these specimens are not yet digitized. Experiences from other countries like France, the Netherlands, Finland, the US and Australia show that herbaria can be comprehensively and cost-efficiently digitized in a relatively short time due to established workflows and protocols for the high-throughput digitization of flat objects. Most of the herbaria are part of a university (34), fewer belong to municipal museums (10) or state museums (8), six herbaria belong to institutions also supported by federal funds such as Leibniz institutes, and four belong to non-governmental organizations. A common data infrastructure must therefore integrate different kinds of institutions. Making full use of the data gained by digitization requires the set-up of a digital infrastructure for storage, archiving, content indexing and networking as well as standardized access for the scientific use of digital objects. A standards-based portfolio of technical components has already been developed and successfully tested by the Biodiversity Informatics Community over the last two decades, comprising among others access protocols, collection databases, portals, tools for semantic enrichment and annotation, international networking, storage and archiving in accordance with international standards. This was achieved through the funding by national and international programs and initiatives, which also paved the road for the German contribution to the Global Biodiversity Information Facility (GBIF). Herbaria constitute a large part of the German botanical collections that also comprise living collections in botanical gardens and seed banks, DNA- and tissue samples, specimens preserved in fluids or on microscope slides and more. Once the herbaria are digitized, these resources can be integrated, adding to the value of the overall research infrastructure. The community has agreed on tasks that are shared between the herbaria, as the German GBIF model already successfully demonstrates. We have compiled nine scientific use cases of immediate societal relevance for an integrated infrastructure of botanical collections. They address accelerated biodiversity discovery and research, biomonitoring and conservation planning, biodiversity modelling, the generation of trait information, automated image recognition by artificial intelligence, automated pathogen detection, contextualization by interlinking objects, enabling provenance research, as well as education, outreach and citizen science. We propose to start this initiative now in order to valorize German botanical collections as a vital part of a worldwide biodiversity data pool.
The ability to rapidly generate and share molecular, visual, and acoustic data, and to compare them with existing information, and thereby to detect and name biological entities is fundamentally changing our understanding of evolutionary relationships among organisms and is also impacting taxonomy. Harnessing taxonomic data for rapid, automated species identification by machine learning tools or DNA metabarcoding techniques has great potential but will require their review, accessible storage, comprehensive comparison, and integration with prior knowledge and information. Currently, data production, management, and sharing in taxonomic studies are not keeping pace with these needs. Indeed, a survey of recent taxonomic publications provides evidence that few species descriptions in zoology and botany incorporate DNA sequence data. The use of modern highthroughput (-omics) data is so far the exception in alpha-taxonomy, although they are easily stored in GenBank and similar databases. By contrast, for the more routinely used image data, the problem is that they are rarely made available in openly accessible repositories. Improved sharing and re-using of both types of data requires institutions that maintain long-term data storage and capacity with workable, user-friendly but highly automated pipelines. Top priority should be given to standardization and pipeline development for the easy submission and storage of machine-readable data (e.g., images, audio files, videos, tables of measurements). The taxonomic community in Germany and the German Federation for Biological Data are researching options for a higher level of automation, improved linking among data submission and storage platforms, and for making existing taxonomic information more readily accessible.
Nucleic acid and protein sequencing-based analyses are increasingly applied to determine origin, identity and traits of environmental (biological) objects and organisms. In this context, the need for corresponding data structures has become evident. As existing schemas and community standards in the domains of biodiversity and molecular biological research are comparatively limited with regard to the number of generic and specific elements, previous schemas for describing the physical and digital objects need to be replaced or expanded by new elements for covering the requirements from meta-omics techniques and operational details. On the one hand, schemas and standards are hitherto mostly focussed on elements, descriptors, or concepts that are relevant for data exchange and publication, on the other hand, detailed operational aspects regarding origin context and laboratory processing, as well as data management details, like the documentation of physical and digital object identifiers, are rather neglected. The conceptual schema for Meta-omics Data and Collection Objects (MOD-CO; https://www.mod-co.net/) has been set up recently Rambold et al. 2019. It includes design elements (descriptors or concepts), describing structural and operational details along the work- and dataflow from gathering environmental samples to the various transformation, transaction, and measurement steps in the laboratory up to sample and data publication and archiving. The concepts are named according to a multipartite naming structure, describing internal hierarchies and are arranged in concept (sub-)collections. By supporting various kinds of data record relationships, the schema allows for the concatenation of individual records of the operational segments along a workflow (Fig. 1). Thus, it may serve as a logical and structural backbone for laboratory information management systems. The concept structure in version 1.0 comprises 653 descriptors (concepts) and 1,810 predefined descriptor states, organised in 37 concept (sub-)collections. The published version 1.0 is available as various schema representations of identical content (https://www.mod-co.net/wiki/Schema_Representations). A normative XSD (= XML Schema Definition) for the schema version 1.0 is available under http://schema.mod-o.net/MOD-CO_1.0.xsd. The MOD-CO concepts might be integrated as descriptor/element structures in the relational database DiversityDescriptions (DWB-DD) an open-source and freely available software of the Diversity Workbench (DWB; https://diversityworkbench.net/Portal/DiversityDescriptions; https://diversityworkbench.net). Currently, DWB-DD is installed at the Data Center of the Bavarian Natural History Collections (SNSB) to build an instance of its own for storing and maintaining MOD-CO-structured meta-omics research data packages and enrich them with ‘metadata’ elements from the community standards Ecological Markup Language (EML), Minimum Information about any (x) Sequence (MIxS), Darwin Core (DwC) and Access to Biological Collection Data (ABCD). These activities are achieved in the context of ongoing FAIR ('Findable, Accessible, Interoperable and Reuseable') biodiversity research data publishing via the German Federation for Biological Data (GFBio) network (https://www.gfbio.org/). Version 1.1 of the schema with extended collections of structural and operational design concepts is scheduled for 2020.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.