Summary
Cloud Computing is a type of distributed system that is usually based on the services offered to the user based on SLA contract. In this case, the implementation of a fault‐tolerant system that ensures the reliability and the services continuity becomes a major requirement. In this paper, we propose a fault tolerance strategy based on checkpointing and replication. Our approach uses a smart checkpoint infrastructure for cloud computing tasks. The checkpoints are stored in alternative already paid VMs. This allows resuming a task execution faster and cheaper after a node crash. Since checkpoints are distributed and replicated, our approach increases also the system reliability. The experimental results show the effectiveness of the proposed strategy in term of energy consumption, SLA (System Level Aggregation) violation, and reliability.