The promise of an easy access to a virtually unlimited number of resources makes Infrastructure as a Service Clouds a good candidate for the execution of data-intensive workflow applications composed of hundreds of computational tasks. Thanks to a careful execution planning, workflow management systems can build a tailored compute infrastructure by combining a set of virtual machine instances. However, these applications usually rely on files to handle dependencies between tasks. A storage space shared by all virtual machines may become a bottleneck and badly impact the application execution time. In this article, we propose an original data-aware planning algorithm that leverages two characteristics of a family of virtual machines instances, that is, a large number of cores and a dedicated storage space on fast SSD drives, to improve data locality, hence reducing the amount of data transfers over the network during the execution of a workflow. We also propose a simulation-driven approach to solve a cost-performance optimization problem and correctly dimension the virtual infrastructure onto which execute a given workflow. Experiments conducted with real application workflows show the benefits of the presented algorithms. The data-aware planning leads to a clear reduction of both execution time and volume of data transferred over the network while the simulation-driven approach allows us to dimension the infrastructure in a reasonable time.
K E Y W O R D Sdata-intensive workflows, IaaS cloud, makespan reduction, workflow scheduling
INTRODUCTIONScientific workflows constitute an appealing approach to express the complex orchestration of interdependent computations and have become mainstream in many scientific domains. 1 They usually allow users to describe the different steps needed to go from a typically vast amount of data generated by a scientific experiment to the production of an original scientific result. The execution of such data-intensive applications made of hundreds of computational tasks on large scale distributed infrastructures is usually handled by a workflow management system (WMS). [2][3][4] Tasks such as resource selection, data management, or computation scheduling are delegated to the WMS, hence hiding the complexity of these operations to the end user.Commodity clusters and computing grids have long been the infrastructures of choice to execute scientific workflows. The institution of the owner of the workflow generally hosts and manages the former, hence easing the access to resources, while the latter allowed scientists to run their workflows at an unprecedented scale by aggregating resources from multiple institutions. With the support of major companies such as Amazon, 5Google, 6 or Microsoft, 7 Infrastructure as a Service (IaaS) clouds have become serious contenders to clusters and grids. Indeed, IaaS clouds combine their respective advantages by providing an easy access to a virtually unlimited amount of resources. Thanks to a careful planning of a workflow