SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision

Wiewiórka, Marek; Messina, A.; Pacholewska, Alicja; Maffioletti, Sergio; Gawrysiak, Piotr; Okoniewski, Michał

doi:10.1093/bioinformatics/btu343

Cited by 98 publications

(44 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The paper is also valuable for new developments using Spark: it explains how Figure 6: Modified S-Chemo core with support for parametrization and for "batch", "onestage", and "two-stage" predictors each Spark feature can be used to speed up their algorithm. Next, SparkSeq [25] is a platform for analysing sequencing data using Spark. The authors state that its main benefits are keeping the datasets in memory instead of on disk and that it enables interactive analysis.…”

Section: Related Workmentioning

confidence: 99%

Scaling machine learning for target prediction in drug discovery using Apache Spark

Harnie

Saey

Vapirev

et al. 2017

Future Generation Computer Systems

View full text Add to dashboard Cite

We have used Spark to automatically distribute C++ predictors over a cluster. Our Spark application allows near-linear speedup and optimal cluster utilization. The core of the algorithm is easily changed to allow for experimentation. AbstractIn the context of drug discovery, a key problem is the identification of candidate molecules that affect proteins associated with diseases. Inside Janssen Pharmaceutica, the Chemogenomics project aims to derive new candidates from existing experiments through a set of machine learning predictor programs, written in single-node C++. These programs take a long time to run and are inherently parallel, but do not use multiple nodes. We show how we reimplemented the pipeline using Apache Spark, which enabled us to lift the existing programs to a multi-node cluster without making changes to the predictors. We have benchmarked our Spark pipeline against the original, which shows almost linear speedup up to 8 nodes. In addition, our pipeline generates fewer intermediate files while allowing easier checkpointing and monitoring.

show abstract

Section: Related Workmentioning

confidence: 99%

Scaling machine learning for target prediction in drug discovery using Apache Spark

Harnie

Saey

Vapirev

et al. 2017

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…It distributes massive data collections across multiple nodes within a cluster of commodity servers, Spark, on the other hand, is a data-processing tool that operates on those distributed data collections; it does not do distributed storage. To study the utility of Apache Spark in the genomic context, SparkSeq was created [72]. It is a general-purpose, flexible, and easily extendable library for genomic cloud computing, and can be used to build genomic analysis pipelines in Scala and run them in an interactive way.…”

Section: Most Bioinformatics Tools Are Not Cloud-awarementioning

confidence: 99%

Cloud Computing for Next-Generation Sequencing Data Analysis

Zhao¹,

Watrous²,

Zhang³

et al. 2017

Cloud Computing - Architecture and Applications

View full text Add to dashboard Cite

High-throughput next-generation sequencing (NGS) technologies have evolved rapidly and are reshaping the scope of genomics research. The substantial decrease in the cost of NGS techniques in the past decade has led to its rapid adoption in biological research and drug development. Genomics studies of large populations are producing a huge amount of data, giving rise to computational issues around the storage, transfer, and analysis of the data. Fortunately, cloud computing has recently emerged as a viable option to quickly and easily acquire the computational resources for large-scale NGS data analyses. Some cloud-based applications and resources have been developed specifically to address the computational challenges of working with very large volumes of data generated by NGS technology. In this chapter, we will review some cloud-based systems and solutions for NGS data analysis, discuss the practical hurdles and limitations in cloud computing, including data transfer and security, and share the lessons we learned from the implementation of Rainbow, a cloud-based tool for large-scale genome sequencing data analysis.

show abstract

“…A generalization of the MapReduce model, Spark [16] has enabled the design of many complex cloud based applications and demonstrated very good performance numbers [17], [18], [19], [20]. To achieve such performance, the Spark runtime relies on an innovative data structure, called Resilient Distributed Dataset (RDD) [16], which is used to store distributed data collections with the support of parallel access and fault-tolerance.…”

Section: Introductionmentioning

confidence: 99%

The Cloud as an OpenMP Offloading Device

Yviquel¹,

Araújo²

2017

2017 46th International Conference on Parallel Processing (ICPP)

View full text Add to dashboard Cite

Abstract-Computation offloading is a programming model in which program fragments (e.g. hot loops) are annotated so that their execution is performed in dedicated hardware or accelerator devices. Although offloading has been extensively used to move computation to GPUs, through directive-based annotation standards like OpenMP, offloading computation to very large computer clusters can become a complex and cumbersome task. It typically requires mixing programming models (e.g. OpenMP and MPI) and languages (e.g. C/C++ and Scala), dealing with various access control mechanisms from different clouds (e.g. AWS and Azure), and integrating all this into a single application. This paper introduces the cloud as a computation offloading device. It integrates OpenMP directives, cloud based mapreduce Spark nodes and remote communication management such that the cloud appears to the programmer as yet another device available in its local computer. Experiments using LLVM, OpenMP 4.5 and Amazon EC2 show the viability of the proposed approach and enable a thorough analysis of the performance and costs involved in cloud offloading. The results show that although data transfers can impose overheads, cloud offloading can still achieve promising speedups of up to 86x in 256 cores for the 2MM benchmark using 1GB matrices.

show abstract

SparkSeq: fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision

Abstract: Available under open source Apache 2.0 license: https://bitbucket.org/mwiewiorka/sparkseq/.

Cited by 98 publications

References 8 publications

Scaling machine learning for target prediction in drug discovery using Apache Spark

Scaling machine learning for target prediction in drug discovery using Apache Spark

Cloud Computing for Next-Generation Sequencing Data Analysis

The Cloud as an OpenMP Offloading Device

Contact Info

Product

Resources

About