Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Ernstsen, Martin; Kjærner-Semb, Erik; Willassen, Nils Peder; Bongo, Lars Ailo

doi:10.1007/978-3-319-14325-5_23

Cited by 2 publications

(4 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mario [40] is a system for interactive iterative data processing (figure 3). We have designed Mario for interactive parameter tuning of biological data analysis pipeline tools.…”

Section: Mariomentioning

confidence: 99%

See 1 more Smart Citation

Integrating Data-Intensive Computing Systems with Biological Data Analysis Frameworks

Pedersen

Raknes

Ernstsen

et al. 2015

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

Self Cite

View full text Add to dashboard Cite

Biological data analysis is typically implemented using a pipeline that combines many data analysis tools and meta-databases. These pipelines must scale to very large datasets, and therefore often require parallel and distributed computing. There are many infrastructure systems for data-intensive computing. However, most biological data analysis pipelines do not leverage these systems. An important challenge is therefore to integrate biological data analysis frameworks with data-intensive computing infrastructure systems. In this paper, we describe how we have extended data-intensive computing systems to support unmodified biological data analysis tools. We also describe four approaches for integrating the extended systems with biological data analysis frameworks, and discuss challenges for such integration on production platforms. Our results demonstrate how biological data analysis pipelines can benefit from infrastructure systems for data-intensive computing.

show abstract

“…Mario [40] is a system for interactive iterative data processing (figure 3). We have designed Mario for interactive parameter tuning of biological data analysis pipeline tools.…”

Section: Mariomentioning

confidence: 99%

“…Mario combines reservoir sampling, fine-grained caching of derived datasets, and a data-parallel processing model for quickly computing the results of changes to pipeline parameters. It adds less than 100ms of overhead per pipeline stage, and it does not add significant computation, memory, or storage overhead to compute nodes [40].…”

Section: Mariomentioning

confidence: 99%

Integrating Data-Intensive Computing Systems with Biological Data Analysis Frameworks

Pedersen

Raknes

Ernstsen

et al. 2015

2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Mario [37] is a system for interactive iterative data processing. We have designed Mario for interactive parameter tuning of biological data analysis pipeline tools.…”

Section: Mariomentioning

confidence: 99%

“…We achieved a system for iterative parallel processing that adds less than 100ms of overhead per pipeline stage, and that does not add significant computation, memory, or storage overhead to compute nodes (additional experimental results are in [37]). We found HBase to be very well suited for efficiently storing and accessing the sparse data-structures used by Mario.…”

Section: Mariomentioning

confidence: 99%

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Bongo

Pedersen

Ernstsen

2015

Computational Intelligence Methods for Bioinformatics and Biostatistics

Self Cite

View full text Add to dashboard Cite

Abstract. Biological data analysis is typically implemented using a deep pipeline that combines a wide array of tools and databases. These pipelines must scale to very large datasets, and consequently require parallel and distributed computing. It is therefore important to choose a hardware platform and underlying data management and processing systems well suited for processing large datasets. There are many infrastructure systems for such data-intensive computing. However, in our experience, most biological data analysis pipelines do not leverage these systems.We give an overview of data-intensive computing infrastructure systems, and describe how we have leveraged these for: (i) scalable fault-tolerant computing for large-scale biological data; (ii) incremental updates to reduce the resource usage required to update large-scale compendium; and (iii) interactive data analysis and exploration. We provide lessons learned and describe problems we have encountered during development and deployment. We also provide a literature survey on the use of data-intensive computing systems for biological data processing. Our results show how unmodified biological data analysis tools can benefit from infrastructure systems for data-intensive computing.

show abstract

Mario: Interactive Tuning of Biological Analysis Pipelines Using Iterative Processing

Cited by 2 publications

References 17 publications

Integrating Data-Intensive Computing Systems with Biological Data Analysis Frameworks

Integrating Data-Intensive Computing Systems with Biological Data Analysis Frameworks

Data-Intensive Computing Infrastructure Systems for Unmodified Biological Data Analysis Pipelines

Contact Info

Product

Resources

About