2019
DOI: 10.1038/sdata.2019.21
|View full text |Cite
|
Sign up to set email alerts
|

The variable quality of metadata about biological samples used in biomedical experiments

Abstract: We present an analytical study of the quality of metadata about samples used in biomedical experiments. The metadata under analysis are stored in two well-known databases: BioSample—a repository managed by the National Center for Biotechnology Information (NCBI), and BioSamples—a repository managed by the European Bioinformatics Institute (EBI). We tested whether 11.4 M sample metadata records in the two repositories are populated with values that fulfill the stated requirements for such values. Our study reve… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
76
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 75 publications
(77 citation statements)
references
References 23 publications
1
76
0
Order By: Relevance
“…This problem is also common to other fields of research. A recent metaanalysis of 11.4 million metadata of samples used in biomedical experiments revealed multiple anomalies and highlighted the need for a more robust approach to reporting metadata [36]. Inclusion of metadata is crucial for finding suitable data sets for meta-analyses, will help to ensure interoperability of the metadata, and ultimately will allow us to begin to understand how the host plant (the environment of the rhizosphere) shapes macroecological patterns in rhizosphere microbiomes.…”
Section: Can We Transcend Taxonomic and Geographic Limitations?mentioning
confidence: 99%
“…This problem is also common to other fields of research. A recent metaanalysis of 11.4 million metadata of samples used in biomedical experiments revealed multiple anomalies and highlighted the need for a more robust approach to reporting metadata [36]. Inclusion of metadata is crucial for finding suitable data sets for meta-analyses, will help to ensure interoperability of the metadata, and ultimately will allow us to begin to understand how the host plant (the environment of the rhizosphere) shapes macroecological patterns in rhizosphere microbiomes.…”
Section: Can We Transcend Taxonomic and Geographic Limitations?mentioning
confidence: 99%
“…User-defined fields, though useful and even necessary in certain situations, have led to a significant increase in heterogeneity across this dataset and others (5). The use of word embeddings for clustering attributes by semantic similarity revealed a lack of normalization in attribute naming, mostly in the form of small deviations in spelling and punctuation (e.g., cell type and Cell type).…”
Section: Discussionmentioning
confidence: 99%
“…(5) recently described the variable state of the metadata available in databases such as NCBI's BioSample and the European Bioinformatics Institute's (EBI) BioSamples (6). The infrequent use of controlled vocabularies in the metadata submission process, coupled by the allowance for the creation of user-defined attributes, has led to an explosion of heterogeneity in the overall metadata landscape (5). This can often hinder researchers' ability to fully utilize the potential information that a given dataset, or a meta-analysis of multiple datasets, might hold.…”
Section: Introductionmentioning
confidence: 99%
“…We constructed an evaluation pipeline to drive the analysis workflow ( Figure 8). This pipeline consists of 7 sequential steps: (1) content download from NCBI BioSample and EBI BioSamples databases; (2) template design for each of those databases and generation of the corresponding template instances; (3) linkage of template instance field names and values to ontology terms; (4) dataset splitting into training and test sets; (5) generation of rules to drive the recommendations from the training set; (6) accuracy measurement using the test set; and (7) results analysis. These steps are now described in more detail.…”
Section: Discussionmentioning
confidence: 99%