2021
DOI: 10.1093/database/baab021
|View full text |Cite
|
Sign up to set email alerts
|

Increasing metadata coverage of SRA BioSample entries using deep learning–based named entity recognition

Abstract: High-quality metadata annotations for data hosted in large public repositories are essential for research reproducibility and for conducting fast, powerful and scalable meta-analyses. Currently, a majority of sequencing samples in the National Center for Biotechnology Information’s Sequence Read Archive (SRA) are missing metadata across several categories. In an effort to improve the metadata coverage of these samples, we leveraged almost 44 million attribute–value pairs from SRA BioSample to train a scalable,… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 15 publications
(9 citation statements)
references
References 17 publications
(18 reference statements)
0
9
0
Order By: Relevance
“…Many automated or semi-automated methods aim to standardize metadata by clustering or mapping to ontologies ( 19 ); however, collecting precise metadata about locations remains a laborious and error-prone process ( 2 , 4 ). In this study, SGMC employed a semi-automated technique that utilizes cloud and generative artificial intelligence approaches (i.e., ChatGPT) for the curation and update of geospatial metadata from the NCBI SRA.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…Many automated or semi-automated methods aim to standardize metadata by clustering or mapping to ontologies ( 19 ); however, collecting precise metadata about locations remains a laborious and error-prone process ( 2 , 4 ). In this study, SGMC employed a semi-automated technique that utilizes cloud and generative artificial intelligence approaches (i.e., ChatGPT) for the curation and update of geospatial metadata from the NCBI SRA.…”
Section: Discussionmentioning
confidence: 99%
“…However, like all large datasets, the SRA may contain errors and inaccuracies ( 3 ). These errors may arise from human mistakes (e.g., typographical errors, incorrect entries, missing information), technical issues (e.g., faulty data submission, transfer, or processing due to issues with the software, hardware, or network), and/or lack of standardization (e.g., expanded generation of user-defined properties and infrequent use of controlled vocabularies during data submission over time) ( 4 , 5 ). As a result, SRA users may encounter absent or erroneous data across categories, including missing fields from data sources in the cloud, unclear synonyms, spelling variants, and heterogeneous sample data specification ( 2 ).…”
Section: Introductionmentioning
confidence: 99%
“…Our reanalysis again demonstrates, if needed, the importance of sharing clinical data with public databases. As previously observed, too many clinical articles are still published without the relative sample codes or are deposited with incorrect labels [ 99 , 100 ] and private requests remain unanswered. This is not only a limit to the reproducibility of published results but, as demonstrated by this study and several other meta-analyses, new important biological knowledge can be produced with ML and AI (Artificial Intelligence) approaches.…”
Section: Discussionmentioning
confidence: 99%
“…It has been shown that as the amount and quality of the data increase, the output and performance of the system increase as well. 58 , 59 High-resolution, large datasets are ideal for ML applications. The importance of the scale of the dataset can be seen in studies and AI-based models where an extremely vast amount of data has been used to train the model, such as ChatGPT where billions of data were used, and in the study done by Liang et al 31 where more than 100 million data points were used and achieved a remarkable outcome.…”
Section: Cracking the Code—predictive Modeling And Machine Learningmentioning
confidence: 99%