2008 IEEE International Conference on Electro/Information Technology 2008
DOI: 10.1109/eit.2008.4554349
|View full text |Cite
|
Sign up to set email alerts
|

Module Prototype for Online Failure Prediction for the IBM Blue Gene/L

Abstract: The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system.Although reactive fault tolerant policies effectively minimize the effects of faults, it has been… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2009
2009
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 12 publications
0
5
0
Order By: Relevance
“…Event filtering is described in [18], and fault/failure prediction and analysis in [17] and [19]. [29] carries through to simulation based experimental evaluation of predictions. [8] builds very thorough models of reliability for BlueGene systems and also give a very careful review of recent work on reliability models and analyses.…”
Section: Related Workmentioning
confidence: 99%
“…Event filtering is described in [18], and fault/failure prediction and analysis in [17] and [19]. [29] carries through to simulation based experimental evaluation of predictions. [8] builds very thorough models of reliability for BlueGene systems and also give a very careful review of recent work on reliability models and analyses.…”
Section: Related Workmentioning
confidence: 99%
“…However, there are several works in Network Fault Prediction among which [18], [36], [61]- [65] which treat the NFP problem without using ML methods and can show interesting results. One example is the case of Hood et al [61], where they show show that it may be possible to predict network failure several hops away in space and around 10 minutes away in time (further testing and specification needed).…”
Section: ML Methods For Network Fault Predictionmentioning
confidence: 99%
“…For example, IBM cite a failure rate of 0.02 faults/month/TF on a BlueGene machine, which scales to around 1 fault/minute on an ExaFLOP system [21].…”
Section: Fault Tolerancementioning
confidence: 99%