A Review of Scalable Bioinformatics Pipelines

Fjukstad, Bjørn; Bongo, Lars Ailo

doi:10.1007/s41019-017-0047-z

Cited by 30 publications

(24 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In line with current international efforts of standardizing workflow descriptions (11), analysis workflows in Trecode are written using WDL (7) and are executed by the Cromwell workflow executer (11). When generating workflow code, our emphasis is on reuse, which has resulted in a compact non-redundant and well documented code base which is easy to maintain, extend and reuse.…”

Section: Discussionmentioning

confidence: 99%

Trecode: a FAIR eco-system for the analysis and archiving of omics data in a combined diagnostic and research setting

Hehir-Kwa

et al. 2020

Preprint

View full text Add to dashboard Cite

MotivationThe increase in speed, reliability and cost-effectiveness of high-throughput sequencing has led to the widespread clinical application of genome (WGS), exome (WXS) and transcriptome analysis. WXS and RNA sequencing is now being implemented as standard of care for patients and for patients included in clinical studies. To keep track of sample relationships and analyses, a platform is needed that can unify metadata for diverse sequencing strategies with sample metadata whilst supporting automated and reproducible analyses. In essence ensuring that analysis is conducted consistently, and data is Findable, Accessible, Interoperable and Reusable (FAIR).ResultsWe present “Trecode”, a framework that records both clinical and research sample (meta) data and manages computational genome analysis workflows executed for both settings. Thereby achieving tight integration between analyses results and sample metadata. With complete, consistent and FAIR (meta) data management in a single platform, stacked bioinformatic analyses are performed automatically and tracked by the database ensuring data provenance, reproducibility and reusability which is key in worldwide collaborative translational research.Availability and implementationThe Trecode data model, codebooks, NGS workflows and client programs are currently being cleared from local compute infrastructure dependencies and will become publicly available in spring 2021.Contactp.kemmeren@prinsesmaximacentrum.nl

show abstract

Section: Discussionmentioning

confidence: 99%

Trecode: a FAIR eco-system for the analysis and archiving of omics data in a combined diagnostic and research setting

Hehir-Kwa

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…One of the notable attributes of the popular BLAST search is that it scales with the number of CPU cores [51]. As a result, to present NORTH as an alternative to BLAST-based approaches, we propose a scalable implementation of NORTH, which will aid clustering of plethora of genes.…”

Section: Scalabilitymentioning

confidence: 99%

NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm

Ibtehaz

Ahmed

Saha

et al. 2019

Preprint

View full text Add to dashboard Cite

Background: Identifying orthologous genes plays a pivotal role in comparative genomics as the orthologous genes remain less diverged in the course of evolution. However, identifying orthologous genes is often difficult, slow, and idiosyncratic, especially in the presence of multiplicity of domains in proteins, evolutionary dynamics, multiple paralogous genes, incomplete genome data, and for distantly related species.Results: We present NORTH, a novel, automated, highly accurate and scalable machine learning based orhtologous gene cluster prediction method. We have utilized the biological basis of orthologous genes and made an effort to incorporate appropriate ideas from machine learning (ML) and natural language processing (NLP). NORTH outperforms the frequently used existing orthologous clustering algorithms on the OrthoBench benchmark, not only just quantitatively with a high margin, but qualitatively under the challenging scenarios as well. Furthermore, we studied 12,55,877 genes in the largest 250 orthologous clusters from the KEGG database, across 3,880 organisms comprising the six major groups of life. NORTH is able to cluster them with 98.48% precision, 98.43% recall and 98.44% F 1 score.Conclusions: This is the first study that maps the orthology identification to the text classification problem, and achieves remarkable accuracy and scalability. NORTH thus advances the state-of-the-art in orthologous gene prediction, and has the potential to be considered as an alternative to the existing phylogenetic tree and BLAST based methods.

show abstract

“…The C3PO MUSC Transdisciplinary Collaborative Center system ingests clinical data from REDCap [36] for the project and integrates it into the OMOP model in its Spark/Hadoop framework. Since, C3PO was developed so it can generalize to other data types such as genomic and imaging, Spark/Hadoop frameworks [37][38][39][40] for genomic and imaging can be integrated in future versions of the system.…”

Section: Generalizabilitymentioning

confidence: 99%

Data Integration Strategies for Predictive Analytics in Precision Medicine

Frey

2018

Per. Med.

View full text Add to dashboard Cite

With the rapid growth of health-related data including genomic, proteomic, imaging and clinical, the arduous task of data integration can be overwhelmed by the complexity of the environment including data size and diversity. This report examines the role of data integration strategies for big data predictive analytics in precision medicine research. Infrastructure-as-code methodologies will be discussed as a means of integrating and managing data. This includes a discussion on how and when these strategies can be used to lower barriers and address issues of consistency and interoperability within medical research environments. The goal is to support translational research and enable healthcare organizations to integrate and utilize infrastructure to accelerate the adoption of precision medicine.

show abstract

A Review of Scalable Bioinformatics Pipelines

Cited by 30 publications

References 27 publications

Trecode: a FAIR eco-system for the analysis and archiving of omics data in a combined diagnostic and research setting

Trecode: a FAIR eco-system for the analysis and archiving of omics data in a combined diagnostic and research setting

NORTH: a highly accurate and scalable Naive Bayes based ORTHologous gene clustering algorithm

Data Integration Strategies for Predictive Analytics in Precision Medicine

Contact Info

Product

Resources

About