Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

Chen, Xin; Lü, Chao; Pattabiraman, Karthik

doi:10.1109/issre.2014.34

Cited by 79 publications

(47 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Chen et al [4] presented a study about failures in Cloud environment, using measured data from Google cluster [20]. These studies show the increasing failure rates in HPC clusters and Cloud clusters.…”

Section: Related Workmentioning

confidence: 99%

Machine learning based job status prediction in scientific clusters

Yoo

Sim

2016

2016 SAI Computing Conference (SAI)

View full text Add to dashboard Cite

Abstract-Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. We set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To explore this possibility, we have developed a job status prediction method for the execution of jobs on scientific clusters. The Random Forests algorithm was applied to extract and characterize the patterns of unsuccessful job statuses. Experimental results show that our method can predict the unsuccessful job statuses from the monitored ongoing job executions in 99.8% the cases with 83.6% recall and 94.8% precision. This prediction accuracy can be sufficiently high that it can be used to mitigation procedures of predicted failures.

show abstract

Section: Related Workmentioning

confidence: 99%

Machine learning based job status prediction in scientific clusters

Yoo

Sim

2016

2016 SAI Computing Conference (SAI)

View full text Add to dashboard Cite

show abstract

“…Cloud systems experience frequent failures due to their large-scale and distributed nature [16]. Failures of any components in the cloud may cause the jobs to be interrupted.…”

Section: Related Workmentioning

confidence: 99%

“…Failures of any components in the cloud may cause the jobs to be interrupted. Jobs may span thousands of cloud components and run for a long time before being interrupted, which leads to the wastage of energy and other resources [16]. Thus, one of the main challenges in cloud systems is to assure the reliability of job execution in the presence of failures.…”

Section: Related Workmentioning

confidence: 99%

Virtual Machine Replication on Achieving Energy-Efficiency in a Cloud

Mondal

Muppala

Machida³

2016

Electronics

View full text Add to dashboard Cite

Abstract:The rapid growth in cloud service demand has led to the establishment of large-scale virtualized data centers in which virtual machines (VMs) are used to handle user requests for service. A user's request cannot be completed if the VM fails. Replication mechanisms can be used to mitigate the impact of failures. Further, data centers consume a large amount of energy resulting in high operating costs and contributing to significant greenhouse gas (GHG) emissions. In this paper, we focus on Infrastructure as a Service (IaaS) cloud where user job requests are processed by VMs and analyze the effectiveness of VM replications in terms of job completion time performance as well as energy consumption. Three different schemes: cold, warm, and hot replications are considered. The trade-offs between job completion time and energy consumption in different replication schemes are characterized through comprehensive analytical models which capture VM state transitions and associated power consumption patterns. The effectiveness of replication schemes are demonstrated through experimental results. To verify the validity of the proposed analytical models, we extend the widely used cloud simulator CloudSim and compare the simulation results with analytical solutions.

show abstract

“…Cloud applications may span thousands of nodes and run for a long time before being aborted, which leads to the wastage of energy and other resources. [3,4,5] In order to minimize failed execution and thus the multiple re-executions of the same workflow fault tolerance techniques must be investigated and supported. Since the numbers of failures are high and the types of them vary, general methods can hardly exist.…”

Section: Relationship To Cloud-based Solutionsmentioning

confidence: 99%

Usability of Scientific Workflow in Dynamically Changing Environment

Bánáti

Kail

Kacsuk

et al. 2015

IFIP Advances in Information and Communication Technology

View full text Add to dashboard Cite

Abstract. Scientific workflow management systems are mainly data-flow oriented, which face several challenges due to the huge amount of data and the required computational capacity which cannot be predicted before enactment. Other problems may arise due to the dynamic access of the data storages or other data sources and the distributed nature of the scientific workflow computational infrastructures (cloud, cluster, grid, HPC), which status may change even during running of a single workflow instance. Many of these failures could be avoided with workflow management systems that provide provenance based dynamism and adaptivity to the unforeseen scenarios arising during enactment. In our work we summarize and categorize the failures that can arise in cloud environment during enactment and show the possibility of prediction and avoidance of failures with dynamic and provenance support.

show abstract

Failure Analysis of Jobs in Compute Clouds: A Google Cluster Case Study

Cited by 79 publications

References 24 publications

Machine learning based job status prediction in scientific clusters

Machine learning based job status prediction in scientific clusters

Virtual Machine Replication on Achieving Energy-Efficiency in a Cloud

Usability of Scientific Workflow in Dynamically Changing Environment

Contact Info

Product

Resources

About