Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies produce large amounts of genome-scale genetic diversity data for wild populations, but most (87%) do not include the associated spatial and temporal metadata necessary for them to be reused in monitoring programs or for acknowledging the sovereignty of nations or Indigenous peoples. We undertook a distributed datathon to quantify the This is an open access article under the terms of the Creative Commons Attribution-NonCommercial License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited and is not used for commercial purposes.
Genetic diversity within species represents a fundamental yet underappreciated level of biodiversity. Because genetic diversity can indicate species and population resilience to changing climate, its measurement is relevant to many national and global conservation policy targets. Many studies of evolutionary biology, molecular ecology and conservation genetics produce large amounts of genome-scale genetic diversity data for wild populations. While open data policies have ensured an abundance of freely available genomic data stored in the databases of the International Nucleotide Sequence Database Collaboration (INSDC), only about 13% of current accessions have the associated spatial and temporal metadata in INSDC necessary to be reused in monitoring programs, macrogenetic studies, or for acknowledging the sovereignty of nations or Indigenous Peoples. We undertook a “distributed datathon” to quantify the availability of these missing metadata in sources external to the INSDC and to test the hypothesis that these metadata decay with time. We also worked to remediate these missing metadata by extracting them, when present, from associated published papers, online repositories, and/or from direct communication with authors. Starting with 848 programmatically identified candidate datasets (INSDC BioProjects), we manually determined that 492 contained samples from wild populations. We successfully restored spatiotemporal metadata (locality name and/or geospatial coordinates and collection year) for 82% of these 492 datasets (N = 401 BioProjects comprising 42,104 individuals or BioSamples). We also quantified the availability of 33 additional categories of metadata in sources external to the INSDC. Information about associated publications and the type of habitat from which the samples were taken was the most easily found; information about sampling permits was the most challenging to locate. Looking at papers and online repositories was much more fruitful than contacting authors, who only replied to our email requests 45% of the time. Overall, 23% of our email queries to authors discovered useful metadata. Importantly, we found that the probability of retrieving spatiotemporal metadata declines significantly with the age of the dataset, with a 13.5% yearly decrease for metadata located in published papers or online repositories and up to a 22% yearly decrease for metadata that were only available from authors. This observable metadata decay, mirrored in studies of other types of biological data, should motivate swift updates to data sharing policies and researcher practices to ensure that the valuable context provided by metadata is not lost forever.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.