h i g h l i g h t s• A review of failure-aware algorithms for resource provisioning on clouds.• The development of power-and failure-aware cloud scheduling algorithms that implement vertical and horizontal platform elasticity.• Autonomous readjustment of virtual-to-physical mappings on-the-fly.• Extensive evaluation of the dynamic strategy and of the scheduling algorithms.• Random workloads and workloads that follow the Google cloud tracelogs. a r t i c l e i n f o
t r a c tEmpowered by virtualisation technology, cloud infrastructures enable the construction of flexible and elastic computing environments, providing an opportunity for energy and resource cost optimisation while enhancing system availability and achieving high performance. A crucial requirement for effective consolidation is the ability to efficiently utilise system resources for high-availability computing and energy-efficiency optimisation to reduce operational costs and carbon footprints in the environment. Additionally, failures in highly networked computing systems can negatively impact system performance substantially, prohibiting the system from achieving its initial objectives. In this paper, we propose algorithms to dynamically construct and readjust virtual clusters to enable the execution of users' jobs. Allied with an energy optimising mechanism to detect and mitigate energy inefficiencies, our decisionmaking algorithms leverage virtualisation tools to provide proactive fault-tolerance and energy-efficiency to virtual clusters. We conducted simulations by injecting random synthetic jobs and jobs using the latest version of the Google cloud tracelogs. The results indicate that our strategy improves the work per Joule ratio by approximately 12.9% and the working efficiency by almost 15.9% compared with other state-ofthe-art algorithms.