Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics

Birke, Robert; Giurgiu, Ioana; Chen, Lydia Y.; Wiesmann, D.; Engbersen, Ton

doi:10.1109/dsn.2014.18

Cited by 83 publications

(26 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The authors of [16] conduct a large scale analysis comparing and relating physical and virtual machine failures from commercial data centers. However, their study is limited due to an inconsistent clarity across different data sources they use.…”

Section: Related Workmentioning

confidence: 99%

“…Birke et al [16] suggest that virtual and physical machine TBFs have very similar distributions and the best fit is the gamma distribution with decreasing hazard rate, whereas the times to repair are well modelled with lognormal distribution. Instead, Viswanath et al [14] shows that time between successive failures on the same machine fits well an inverse function model.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Failure process characteristics of cloud-enabled services

Tola

Jiang

Helvik

2017

2017 9th International Workshop on Resilient Networks Design and Modeling (RNDM)

View full text Add to dashboard Cite

Abstract-The design of cloud computing technologies need to guarantee high levels of availability and for this reason there is a large interest in new fault tolerant techniques that are able to keep the resilience of the systems at the desired level. The modeling of these techniques require input information about the operational state of the systems that have a stochastic nature. The aim of this paper is to provide insights into the stochastic behavior of cloud services. By exploiting the willingness of service providers to publicly expose failure incident information on the web, we collected and analyzed dependability features of a large number of incident reports counting more than 10,600 incidents related to 106 services. Through the analysis of failure data information we provide some useful insights about the Poisson nature of cloud service's failure processes by fitting well known models and assessing their suitability.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Failure process characteristics of cloud-enabled services

Tola

Jiang

Helvik

2017

2017 9th International Workshop on Resilient Networks Design and Modeling (RNDM)

View full text Add to dashboard Cite

show abstract

“…Prior work on system reliability [11] points out that there exist positive correlations between application performance and system load. To demystify the time-varying behavior of failed jobs across priorities, we also resort to the system load.…”

Section: A Dependency On System Loadmentioning

confidence: 99%

“…Consequently, for each job, we compute the load indicators for three categories of priority: (i) its priority (ii) lower priority and (iii) higher priority. For example, when a job with priority 7 arrives at the system, we compute three different values of task arrival, considering tasks of priority 7, tasks with priority among [0, 6], and tasks with priority among [8,11], respectively. A similar computation needs to be applied for the throughput, as well as for the number of tasks.…”

Section: B Featuresmentioning

confidence: 99%

Predicting and Mitigating Jobs Failures in Big Data Clusters

Rosà

Chen

Binder

2015

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Self Cite

View full text Add to dashboard Cite

In large-scale datacenters, software and hardware failures are frequent, resulting in failures of job executions that may cause significant resource waste and performance deterioration. To proactively minimize the resource inefficiency due to job failures, it is important to identify them in advance using key job attributes. However, so far, prevailing research on datacenter workload characterization has overlooked job failures, including their patterns, root causes, and impact. In this paper, we aim to develop prediction models and mitigation policies for unsuccessful jobs, so as to reduce the resource waste in big datacenters. In particular, we base our analysis on Google cluster traces, consisting of a large number of big-data jobs with a high task fanout. We first identify the time-varying patterns of failed jobs and the contributing system features. Based on our characterization study, we develop an on-line predictive model for job failures by applying various statistical learning techniques, namely Linear Discriminant Analysis (LDA), Quadratic Discriminant Analysis (QDA), and Logistic Regression (LR). Furthermore, we propose a delay-based mitigation policy which, after a certain grace period, proactively terminates the execution of jobs that are predicted to fail. The particular objective of postponing job terminations is to strike a good tradeoff between resource waste and false prediction of successful jobs. Our evaluation results show that the proposed method is able to significantly reduce the resource waste by 41.9% on average, and keep false terminations of jobs low, i.e., only 1%. 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 978-1-4799-8006-2/15 $31.00

show abstract

“…Although a large body of related work analyzes failures in large datacenters, most notably in terms of hardware [4], [11], software [7], [8], network components [5], and virtual machines [12], little work has been done [13], [14] in studying the broader class of unsuccessful executions in big-data systems. Nevertheless, deepening our knowledge in this field is of paramount importance, as unsuccessful executions can result in degradation of Quality of Service (QoS), reliability and energy waste that can ultimately lead to a high resource waste and performance impairment.…”

Section: Introductionmentioning

confidence: 99%

Understanding Unsuccessful Executions in Big-Data Systems

Rosà

Chen

Binder

2015

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing

Self Cite

View full text Add to dashboard Cite

Big-data applications are being increasingly used in today's large-scale datacenters for a large variety of purposes, such as solving scientific problems, running enterprise services, and computing data-intensive tasks. Due to the growing scale of these systems and the complexity of running applications, jobs running in big-data systems experience unsuccessful terminations of different nature. While a large body of existing studies sheds light on failures occurred in large-scale datacenters, the current literature overlooks the characteristics and the performance impairment of a broader class of unsuccessful executions which can arise due to application failures, dependency violations, machine constraints, job kills, and task preemption. Nonetheless, deepening our understanding in this field is of paramount importance, as unsuccessful executions can lower user satisfaction, impair reliability, and lead to a high resource waste. In this paper, we describe the problem of unsuccessful executions in big-data systems, and highlight the critical importance of improving our knowledge on this subject. We review the existing literature on this field, discuss its limitations, and present our own contributions to the problem, along with our research plan for the future. 2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing 978-1-4799-8006-2/15 $31.00

show abstract

Failure Analysis of Virtual and Physical Machines: Patterns, Causes and Characteristics

Cited by 83 publications

References 10 publications

Failure process characteristics of cloud-enabled services

Failure process characteristics of cloud-enabled services

Predicting and Mitigating Jobs Failures in Big Data Clusters

Understanding Unsuccessful Executions in Big-Data Systems

Contact Info

Product

Resources

About