1. Species occurrence records from online databases are an indispensable resource in ecological, biogeographical and palaeontological research. However, issues with data quality, especially incorrect geo-referencing or dating, can diminish their usefulness. Manual cleaning is time-consuming, error prone, difficult to reproduce and limited to known geographical areas and taxonomic groups, making it impractical for datasets with thousands or millions of records.2. Here, we present CoordinateCleaner, an r-package to scan datasets of species occurrence records for geo-referencing and dating imprecisions and data entry errors in a standardized and reproducible way. CoordinateCleaner is tailored to problems common in biological and palaeontological databases and can handle datasets with millions of records. The software includes (a) functions to flag potentially problematic coordinate records based on geographical gazetteers, (b) a global database of 9,691 geo-referenced biodiversity institutions to identify records that are likely from horticulture or captivity, (c) novel algorithms to identify datasets with rasterized data, conversion errors and strong decimal rounding and (d) spatio-temporal tests for fossils.3. We describe the individual functions available in CoordinateCleaner and demonstrate them on more than 90 million occurrences of flowering plants from the Global Biodiversity Information Facility (GBIF) and 19,000 fossil occurrences from the Palaeobiology Database (PBDB). We find that in GBIF more than 3.4 million records (3.7%) are potentially problematic and that 179 of the tested contributing This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
Environmental sequencing regularly recovers fungi that cannot be classified to any meaningful taxonomic level beyond "Fungi". There are several examples where evidence of such lineages has been sitting in public sequence databases for up to ten years before receiving scientific attention and formal recognition. In order to highlight these unidentified lineages for taxonomic scrutiny, a search function is presented that produces updated lists of approximately genus-level clusters of fungal ITS sequences that remain unidentified at the phylum, class, and order levels, respectively. The search function (https://unite.ut.ee/top50.php) is implemented in the UNITE database for molecular identification of fungi, such that the underlying sequences and fungal lineages are open to third-party annotation. We invite researchers to examine these enigmatic fungal lineages in the hope that their taxonomic resolution will not have to wait another ten years or more. Key words
Sequence comparison and analysis of the various ribosomal genetic markers are the dominant molecular methods for identification and description of fungi. However, new environmental fungal lineages known only from DNA data reveal significant gaps in our sampling of the fungal kingdom in terms of both taxonomy and marker coverage in the reference sequence databases. To facilitate the integration of reference data from all of the ribosomal markers, we present three sets of general primers that allow for amplification of the complete ribosomal operon from the ribosomal tandem repeats. The primers cover all ribosomal markers: ETS, SSU, ITS1, 5.8S, ITS2, LSU and IGS. We coupled these primers successfully with third-generation sequencing (PacBio and Nanopore sequencing) to showcase our approach on authentic fungal herbarium specimens (Basidiomycota), aquatic chytrids (Chytridiomycota) and a poorly understood lineage of early diverging fungi (Nephridiophagidae). In particular, we were able to generate high-quality reference data with Nanopore sequencing in a high-throughput manner, showing that the generation of reference data can be achieved on a regular desktop computer without the involvement of any large-scale sequencing facility. The quality of the Nanopore generated sequences was 99.85%, which is comparable with the 99.78% accuracy described for Sanger sequencing. With this work, we hope to stimulate the generation of a new comprehensive standard of ribosomal reference data with the ultimate aim to close the huge gaps in our reference datasets. K E Y W O R D S discussions on the implementation of long-read sequencing. The authors would like to acknowledge support from Science for Life Laboratory, the National Genomics Infrastructure, NGI and Uppmax for providing assistance in massive parallel sequencing and computational infrastructure. CW and RHN gratefully acknowledge financial support from Stiftelsen Olle Engkvist Byggmästare, Stiftelsen Lars Hiertas Minne, Kapten Carl Stenholms Donationsfond and Birgit och Birger Wålhströms Minnesfond.
Sequence analysis of the various ribosomal genetic markers is the dominant molecular method for identification and description of fungi. However, there is little agreement on what ribosomal markers should be used, and research groups utilize different markers depending on what fungal groups are targeted. New environmental fungal lineages known only from DNA data reveal significant gaps in the coverage of the fungal kingdom both in terms of taxonomy and marker coverage in the reference sequence databases. In order to integrate references covering all of the ribosomal markers, we present three sets of general primers that allow the amplification of the complete ribosomal operon from the ribosomal tandem repeats. The primers cover all ribosomal markers (ETS, SSU, ITS1, 5.8S, ITS2, LSU, and IGS) from the 5' end of the ribosomal operon all the way to the 3' end. We coupled these primers successfully with third generation sequencing (PacBio and Nanopore sequencing) to showcase our approach on authentic fungal herbarium specimens. In particular, we were able to generate high-quality reference data with Nanopore sequencing in a high-throughput manner, showing that the generation of reference data can be achieved on a regular desktop computer without the need for a large-scale sequencing facility. The quality of the Nanopore generated sequences was 99.85 %, which is comparable with the 99.78 % accuracy described for Sanger sequencing. With this work, we hope to stimulate the generation of a new comprehensive standard of ribosomal reference data with the ultimate aim to close the huge gaps in our reference datasets.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.