Shared Data Science Infrastructure for Genomics Data

Bagheri, Hamid; Muppirala, Usha; Masonbrink, Rick E.; Severin, Andrew J.; Rajan, Hridesh

doi:10.21203/rs.2.4295/v3

Cited by 3 publications

(10 citation statements)

References 13 publications

(13 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…MG1655]’. BoaG is a domain-specific language that uses a Hadoop-based infrastructure for biological data ( Bagheri et al , 2019 ). A BoaG program is submitted to the BoaG infrastructure.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Detecting and correcting misclassified sequences in the large-scale public databases

2020

Self Cite

View full text Add to dashboard Cite

Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the non-redundant (NR) database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than 2 million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability Source code, dataset, documentation, Jupyter notebooks, and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Methodsmentioning

confidence: 99%

“…We utilize a genomics-specific language, BoaG, that uses the Hadoop cluster ( Bagheri et al , 2019 ), to explore annotations in the NR database that is not available in other works.…”

Section: Introductionmentioning

confidence: 99%

Detecting and correcting misclassified sequences in the large-scale public databases

2020

Self Cite

View full text Add to dashboard Cite

show abstract

“…When a BoaG program is executing in parallel, it emits values to the output aggregator that collects all data and provides the final output. Aggregators, for example, top, mean, maximum, and minimum, also can contain indices that would be a grouping operation similar to traditional query languages [9].…”

Section: Boag Domain-specific Languagementioning

confidence: 99%

“…To this end, we utilized BoaG to address these challenges at scale. BoaG belongs to the family of a domain-specific language and shared infrastructure, called Boa, that has been applied to address challenges in mining software repositories [9], genomics data [10], and big data transportation [11]. Boa can process and query terabytes of raw data and uses a backend based on map-reduce to effectively distribute computational analyses and querying tasks.…”

Section: Introductionmentioning

confidence: 99%

Comprehensive Analysis of Non Redundant Protein Database

Bagheri

Dyer

Severin

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset.

show abstract

“…To this end, we utilized BoaG to address these challenges at scale. BoaG belongs to the family of a domain-specific language and shared infrastructure, called Boa, that has been applied to address challenges in mining software repositories [28], genomics data [12], and big data transportation [36]. Boa can process and query terabytes of raw data and uses a backend based on map-reduce to effectively distribute computational analyses and querying tasks.…”

Section: Discussionmentioning

confidence: 99%

Towards data cleaning in large public biological databases

Bagheri¹

Self Cite

View full text Add to dashboard Cite

show abstract

Shared Data Science Infrastructure for Genomics Data

Cited by 3 publications

References 13 publications

Detecting and correcting misclassified sequences in the large-scale public databases

Detecting and correcting misclassified sequences in the large-scale public databases

Comprehensive Analysis of Non Redundant Protein Database

Towards data cleaning in large public biological databases

Contact Info

Product

Resources

About