Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis 2011
DOI: 10.1145/2063384.2063444
|View full text |Cite
|
Sign up to set email alerts
|

Modeling and tolerating heterogeneous failures in large parallel systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
48
0

Year Published

2011
2011
2022
2022

Publication Types

Select...
4
2
1

Relationship

2
5

Authors

Journals

citations
Cited by 66 publications
(48 citation statements)
references
References 22 publications
0
48
0
Order By: Relevance
“…From our previous research we observed that for example memory and processor cache failures usually result in a single faulty component generating hundreds or thousands of messages in less than a day [19].…”
Section: Preprocessingmentioning
confidence: 99%
See 1 more Smart Citation
“…From our previous research we observed that for example memory and processor cache failures usually result in a single faulty component generating hundreds or thousands of messages in less than a day [19].…”
Section: Preprocessingmentioning
confidence: 99%
“…The method has little effect in the sequence extraction phase since, we are looking at frequent patterns and the chance of having two of the same events that happen on two different locations and that have different root causes frequent in a historic log in close to zero [19,20,23]. However the improvement is given in the prediction process when those two events could lead to two different effects and need to be both monitored.…”
Section: Preprocessingmentioning
confidence: 99%
“…Other have proposed rejuvenation techniques to minimize software failures [17], [18]. In addition, failure prediction has been studied [8], [19] and has showed promising results that could be leveraged for online system monitoring. While those studies are fundamental for achieving high fault tolerance at scale, they all consider a rather stable MTBF and they do not take failure bursts into consideration.…”
Section: Related Workmentioning
confidence: 99%
“…Event analysis and classification in large-scale system have been the subject of many studies, and research has shown that they can lead to good prediction results [8], [9]. One of the major challenges in this endeavor is the impact of false positives.…”
Section: B Introspective Systemsmentioning
confidence: 99%
“…The key parameter is the MTBF µ = 1 λ . Weibull distributions are a good example of probability distributions that account for infant mortality, and they are widely used to model failures on computer platforms [42,67,54,39,43]. The definition of Weibull(λ ), the Weibull distribution law of shape parameter k and scale parameter λ , goes as follows:…”
Section: Resilience At Scalementioning
confidence: 99%