Proceedings of the 2007 ACM/IEEE Conference on Supercomputing 2007
DOI: 10.1145/1362622.1362678
|View full text |Cite
|
Sign up to set email alerts
|

Exploring event correlation for failure prediction in coalitions of clusters

Abstract: In large-scale networked computing systems, component failures become norms instead of exceptions. Failure prediction is a crucial technique for self-managing resource burdens. Failure events in coalition systems exhibit strong correlations in time and space domain. In this paper, we develop a spherical covariance model with an adjustable timescale parameter to quantify the temporal correlation and a stochastic model to describe spatial correlation. We further utilize the information of application allocation … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
119
0

Year Published

2008
2008
2022
2022

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 149 publications
(119 citation statements)
references
References 32 publications
0
119
0
Order By: Relevance
“…At the same time, it has also been found [36] that software-related error conditions can accumulate over time, leading to system failing in the long run. Fu and Xu [14,13] analyzed and modeled the failure correlation in an institution-wide computational grid system and a largescale high-performance cluster system. They also proposed failureaware resource management mechanisms by using reconfigurable distributed virtual machines in networked computer environments [10,44,8,9].…”
Section: Related Workmentioning
confidence: 99%
“…At the same time, it has also been found [36] that software-related error conditions can accumulate over time, leading to system failing in the long run. Fu and Xu [14,13] analyzed and modeled the failure correlation in an institution-wide computational grid system and a largescale high-performance cluster system. They also proposed failureaware resource management mechanisms by using reconfigurable distributed virtual machines in networked computer environments [10,44,8,9].…”
Section: Related Workmentioning
confidence: 99%
“…Based on the 2-items rules, the k-items rules are generated in step (4). Different from the Apriori-S algorithm which get the support of the candidate rules by scanning the whole log history (Shown in Fig.…”
Section: Apriori-so Event Correlations Mining Algorithmmentioning
confidence: 99%
“…For example, the work published in the year of 1992 [19] has concluded the impact of correlated failures on dependability is significant. Previous research efforts [3] [4] [5] [6] have shown the correlation analysis of system logs is promising in event prediction and fault diagnosis, and thus it is helpful to resource allocation, job scheduling and system management [3] [5] [9]. Recent work shows the analysis of logs is useful in mining dependency of distributed system [16] [17] or model IT service availability [18].…”
Section: Introductionmentioning
confidence: 99%
“…Recent accomplishments include a failure prediction framework [4], which explores correlations and forecasts the time-between-failure of future instances. Results show the system achieves more than 76% accuracy in offline prediction and more than 70% accuracy in online prediction using failure logs from Los Alamos National Laboratory.…”
Section: System Log and Reliability Analysismentioning
confidence: 99%
“…Recent efforts in proactive FT primarily targeted two aspects, failure prediction [4,9] and process-or virtual-machine-level migration [13,6]. Other initial work focused on a proactive FT framework [12], which combines both to perform prediction triggered migration.…”
Section: Introductionmentioning
confidence: 99%