Summary
As the adoption of Software as a Service (SaaS) cloud computing continues to gain momentum, the arising challenges of scheduling parallel applications on such platforms need to be addressed. Due to the complexity and the fine‐grained parallelism of the workload, as well as the multi‐tenancy of the underlying host environment, end‐user applications are usually prone to transient software failures. Therefore, fault tolerance is one of the most crucial aspects of scheduling in SaaS clouds. It is usually achieved through application‐directed checkpointing. However, selecting an appropriate checkpointing interval is not a trivial task. Unnecessary frequent checkpointing may degrade the system performance. On the other hand, infrequent checkpointing may lead to greater recovery time and thus poorer performance. Consequently, the checkpointing interval must be selected taking into account the failure probability, as well as the nature of the workload. Towards this direction, we investigate via simulation the impact of checkpointing interval selection on the performance of a SaaS cloud, where fine‐grained parallel applications with firm deadlines and approximate computations are scheduled for execution, under various failure probabilities. The simulation results are analyzed, in an attempt to shed light on the relation between the checkpointing interval and failure probability.