From Scripted HPC-Based NGS Pipelines to Workflows on the Cloud

Cała, Jacek; Xu, Yaobo; Wijaya, Eldarina; Missier, Paolo

doi:10.1109/ccgrid.2014.128

Cited by 9 publications

(8 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This paper extends our preliminary workshop publication [13] which reported on initial progress on the Cloude-Genome project, a collaboration between the School of Computing Science and Institute of Genetic Medicine at Newcastle University. This extended version offers the following new contributions:…”

Section: Contributions and Relevance To This Journal Special Issuementioning

confidence: 59%

See 1 more Smart Citation

Scalable and efficient whole-exome data processing using workflows on the cloud

Cała

Marei

et al. 2016

Future Generation Computer Systems

View full text Add to dashboard Cite

Dataow-style workows o_er a simple, high-level programming model for exible prototyping of scienti_c applications as an attractive alternative to low-level scripting. At the same time, workow management systems (WfMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workow over a cloud infrastructure. In theory, the combination of these properties makes workows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WfMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3x speed-up. However, in order to deliver such performance we describe the importance of optimising the workow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by Dseries Azure VMs combined with the implicit use of local disk resources by e-Science Central workow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further e_orts in automating parallelisation of complex pipelines are required.

show abstract

Section: Contributions and Relevance To This Journal Special Issuementioning

confidence: 59%

“…For instance, one may allocate virtual clusters in the cloud, e.g. using StarCluster 13 or CloudMan [36], and then simply transfer data and scripts verbatim. that is atypical of the usually lower performance of the cloud than HPC (cf.…”

Section: Related Workmentioning

confidence: 99%

Scalable and efficient whole-exome data processing using workflows on the cloud

Cała

Marei

et al. 2016

Future Generation Computer Systems

View full text Add to dashboard Cite

show abstract

“…De Oliveira et al [11] propose a provenance based task scheduling algorithm for single site cloud environments. Some adaptation of SWfMSs [6,9] in the cloud environment can provide the parallelism in workflow level or activity level, which is coarse-grained, at a single site cloud. These methods cannot perform parallelism of the tasks of the same activities and they cannot handle the distributed input data at different sites.…”

Section: Related Workmentioning

confidence: 99%

Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud

Liu

Pacitti

Valduriez

et al. 2017

Transactions on Large-Scale Data- And Knowledge-Centered Systems XXXIII

View full text Add to dashboard Cite

Abstract. Recently, some Scientific Workflow Management Systems (SWfMSs) with provenance support (e.g. Chiron) have been deployed in the cloud. However, they typically use a single cloud site. In this paper, we consider a multisite cloud, where the data and computing resources are distributed at different sites (possibly in different regions). Based on a multisite architecture of SWfMS, i.e. multisite Chiron, and its provenance model, we propose a multisite task scheduling algorithm that considers the time to generate provenance data. We performed an extensive experimental evaluation of our algorithm using Microsoft Azure multisite cloud and two real-life scientific workflows (Buzz and Montage). The results show that our scheduling algorithm is up to 49.6% better than baseline algorithms in terms of total execution time.

show abstract

“…To verify our algorithm for real workflow, we used one from the Cloud e-Genome project [22] (Figure 10). The project's overall goal is to facilitate the adoption of genetic testing in clinical practice at a population scale.…”

Section: B Workflows From a Real Scientific Applicationmentioning

confidence: 99%

Performance evaluation for SDN deployment: an approach based on stochastic network calculus

Lin

Huang

et al. 2016

China Commun.

View full text Add to dashboard Cite

The significant increase in the use of cloud computing, has led to an interest in partitioning applications over a set of public and private clouds in order to meet a range of non-functional requirements including performance (for example where private cloud resources alone are insufficient), dependability (e.g. to allow the application to continue to operate even if one cloud fails) and security (for example to ensure that sensitive data is restricted to sufficiently secure clouds and networks). This paper describes a novel deployment planning algorithm to partition complex workflow-based applications over federated clouds, while meeting security requirements. The security issues are based on our previous work which extends the Bell-LaPadula model to encompass cloud computing. Selecting the cheapest option for partitioning a workflow over a set of resources has been shown to be an NP-hard problem, which can take impractically long for partitioning large workflows over multiple clouds. We therefore introduce a novel adaptive partitioning algorithm to handle these large workflow applications, which significantly reduces the time required to choose a sufficientlygood partitioning option. This is based on generating an initial partitioning, and then adapting it to see if a better solution can be found by bringing together on the same node services with significant communication costs. The algorithm has been implemented and evaluated by using both randomly generated and real world scientific workflows. The experiment results show that our algorithm is thousands times quicker than the exhaustive algorithm presented in our previous work. Yet, on average it generates only 25% more costly solutions. We also compared this algorithm with two other methods commonly used to partition workflows over a set of clouds.

show abstract

From Scripted HPC-Based NGS Pipelines to Workflows on the Cloud

Cited by 9 publications

References 27 publications

Scalable and efficient whole-exome data processing using workflows on the cloud

Scalable and efficient whole-exome data processing using workflows on the cloud

Scientific Workflow Scheduling with Provenance Data in a Multisite Cloud

Performance evaluation for SDN deployment: an approach based on stochastic network calculus

Contact Info

Product

Resources

About