Transparent Incremental Updates for Genomics Data Analysis Pipelines

Pedersen, Edvard; Willassen, Nils Peder; Bongo, Lars Ailo

doi:10.1007/978-3-642-54420-0_31

Cited by 6 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We achieved up to 82% reduction in analysis time for compendium updates when using GeStore with an unmodified biological data analysis pipeline ( [34] has additional experimental results). We found HBase to be well suited for the data management requirements of GeStore.…”

Section: Gestorementioning

confidence: 99%

See 1 more Smart Citation

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Bongo

Pedersen

Ernstsen

2015

Computational Intelligence Methods for Bioinformatics and Biostatistics

Self Cite

View full text Add to dashboard Cite

Abstract. Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems.We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.

show abstract

Section: Gestorementioning

confidence: 99%

“…GeStore [34] is a framework for adding transparent incremental updates to data processing pipelines. We use GeStore to incrementally update large-scale compendia such as the IMP compendia described in the previous section.…”

Section: Gestorementioning

confidence: 99%

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Bongo

Pedersen

Ernstsen

2015

Computational Intelligence Methods for Bioinformatics and Biostatistics

Self Cite

View full text Add to dashboard Cite

show abstract

“…GeStore [38] is a system for adding transparent incremental updates to biological data processing pipelines. We use GeStore to periodically update large-scale compendia, such as the IMP compendia described in the previous section.…”

Section: B Gestorementioning

confidence: 99%

“…We built GeStore since the processing time for a full compendium update can be several days even on a large computer cluster, making it impractical to frequently update large-scale compendia. We have achieved up to 82% reduction in analysis time for dataset updates when using GeStore with an unmodified biological data analysis pipeline [38]. GeStore also provides efficient meta-database management for large scale meta-databases.…”

Section: B Gestorementioning

confidence: 99%

See 1 more Smart Citation

Integrating Data-Intensive Computing Systems with Biological Data Analysis Frameworks

Pedersen

Raknes

Ernstsen

et al. 2015

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

Self Cite

View full text Add to dashboard Cite

Biological data analysis is typically implemented using a pipeline that combines many data analysis tools and meta-databases. These pipelines must scale to very large datasets, and therefore often require parallel and distributed computing. There are many infrastructure systems for data-intensive computing. However, most biological data analysis pipelines do not leverage these systems. An important challenge is therefore to integrate biological data analysis frameworks with data-intensive computing infrastructure systems. In this paper, we describe how we have extended data-intensive computing systems to support unmodified biological data analysis tools. We also describe four approaches for integrating the extended systems with biological data analysis frameworks, and discuss challenges for such integration on production platforms. Our results demonstrate how biological data analysis pipelines can benefit from infrastructure systems for data-intensive computing.

show abstract

Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Ernstsen

Kjærner-Semb

Willassen

et al. 2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Transparent Incremental Updates for Genomics Data Analysis Pipelines

Cited by 6 publications

References 27 publications

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Integrating Data-Intensive Computing Systems with Biological Data Analysis Frameworks

Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Contact Info

Product

Resources

About