Scalable analysis of multi-modal biomedical data

Smith, Jaclyn; Shi, Yao; Benedikt, Michael; Nikolić, Miloš

doi:10.1093/gigascience/giab058

Cited by 3 publications

(2 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We designed the relational model to represent alignments and pileup function results as proposed in Sun et al ., 2018 and Smith et al ., 2021. Our package provides both SQL (Structured Query Language) and Dataframe programming interfaces for the Scala and Python (https://github.com/biodatageeks/pysequila) languages.…”

Section: Methodsmentioning

confidence: 96%

“…We used its three extension points: (i) SQL Analyzer -to register new table-valued functions, (ii) Planner -to add our optimized execution strategies for pileup calculations, and (iii) Logical Optimizer -to detect CreateDataSourceTableAsSelectCommand and InsertIntoHadoopFsRelationCommand actions and apply optimizations for direct vectorized writes into the Optimized Row Columnar (ORC) files (Figure 2). We designed the relational model to represent alignments and pileup function results as proposed in Sun et al, 2018 andSmith et al, 2021. Our package provides both SQL (Structured Query Language) and Dataframe programming interfaces for the Scala and Python (https://github. com/biodatageeks/pysequila) languages.…”

Section: Technical Designmentioning

confidence: 99%

See 1 more Smart Citation

Cloud-native distributed genomic pileup operations

Wiewiórka

Szmurło

Stankiewicz

et al. 2022

Preprint

View full text Add to dashboard Cite

Motivation: Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computation nodes. Results: Here, we present a scalable, distributed, and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5x faster) and memory usage (up to 2x less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform, or Amazon Web Services. Together with the already implemented distributed range joins and coverage calculations, our package provides end-users with an unified SQL interface for convenient analyzing of population-scale genomic data in an interactive way. See https://biodatageeks.github.io/sequila/ for details.

show abstract

Section: Methodsmentioning

confidence: 96%

Section: Technical Designmentioning

confidence: 99%

Cloud-native distributed genomic pileup operations

Wiewiórka

Szmurło

Stankiewicz

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

Cloud-native distributed genomic pileup operations

et al. 2022

View full text Add to dashboard Cite

Motivation Pileup analysis is a building block of many bioinformatics pipelines, including variant calling and genotyping. This step tends to become a bottleneck of the entire assay since the straightforward pileup implementations involve processing of all base calls from all alignments sequentially. On the other hand, a distributed version of the algorithm faces the intrinsic challenge of splitting reads-oriented file formats into self-contained partitions to avoid costly data exchange between computational nodes. Results Here, we present a scalable, distributed, and efficient implementation of a pileup algorithm that is suitable for deploying in cloud computing environments. In particular, we implemented: (i) our custom data-partitioning algorithm optimized to work with the alignment reads, (ii) a novel and unique approach to process alignment events from sequencing reads using the MD tags, (iii) the source code micro-optimizations for recurrent operations, and (iv) a modular structure of the algorithm. We have proven that our novel approach consistently and significantly outperforms other state-of-the-art distributed tools in terms of execution time (up to 6.5x faster) and memory usage (up to 2x less), resulting in a substantial cloud cost reduction. SeQuiLa is a cloud-native solution that can be easily deployed using any managed Kubernetes and Hadoop services available in public clouds, like Microsoft Azure Cloud, Google Cloud Platform or Amazon Web Services. Together with the already implemented distributed range joins and coverage calculations, our package provides end-users with an unified SQL interface for convenient analyzing of population-scale genomic data in an interactive way. Availability https://biodatageeks.github.io/sequila/ Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Scalable analysis of multi-modal biomedical data

et al. 2021

Self Cite

View full text Add to dashboard Cite

Background Targeted diagnosis and treatment options are dependent on insights drawn from multi-modal analysis of large-scale biomedical datasets. Advances in genomics sequencing, image processing, and medical data management have supported data collection and management within medical institutions. These efforts have produced large-scale datasets and have enabled integrative analyses that provide a more thorough look of the impact of a disease on the underlying system. The integration of large-scale biomedical data commonly involves several complex data transformation steps, such as combining datasets to build feature vectors for learning analysis. Thus, scalable data integration solutions play a key role in the future of targeted medicine. Though large-scale data processing frameworks have shown promising performance for many domains, they fail to support scalable processing of complex datatypes. Solution To address these issues and achieve scalable processing of multi-modal biomedical data, we present TraNCE, a framework that automates the difficulties of designing distributed analyses with complex biomedical data types. Performance We outline research and clinical applications for the platform, including data integration support for building feature sets for classification. We show that the system is capable of outperforming the common alternative, based on “flattening” complex data structures, and runs efficiently when alternative approaches are unable to perform at all.

show abstract

Scalable analysis of multi-modal biomedical data

Cited by 3 publications

References 45 publications

Cloud-native distributed genomic pileup operations

Cloud-native distributed genomic pileup operations

Cloud-native distributed genomic pileup operations

Scalable analysis of multi-modal biomedical data

Contact Info

Product

Resources

About