2010
DOI: 10.1109/tdsc.2009.4
|View full text |Cite
|
Sign up to set email alerts
|

A Large-Scale Study of Failures in High-Performance Computing Systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
231
1

Year Published

2010
2010
2024
2024

Publication Types

Select...
4
4

Relationship

0
8

Authors

Journals

citations
Cited by 492 publications
(232 citation statements)
references
References 16 publications
0
231
1
Order By: Relevance
“…The failure number actually grows over a period of nearly 18 months, before it eventually starts dropping. The reason most likely is that many problems in hardware, software and configuration are only exposed by real user code in the production workloads [3].…”
Section: ) System Failuresmentioning
confidence: 99%
“…The failure number actually grows over a period of nearly 18 months, before it eventually starts dropping. The reason most likely is that many problems in hardware, software and configuration are only exposed by real user code in the production workloads [3].…”
Section: ) System Failuresmentioning
confidence: 99%
“…For each physical node, the MTBF is programmed according to a Weibull distribution, with a shape parameter of 0.8, which has been shown [37] to well approximate the time between failures for individual nodes, as well as for the entire system. Failed nodes stay unavailable (i.e., mean time to repair (MTTR)) during a period modelled by a lognormal distribution, with a mean time set to 20 min, varying up to 150 min.…”
Section: Failures and Unavailability Propertiesmentioning
confidence: 99%
“…Noticeable progress has been made on failure prediction research and practice, following that more failure traces are made public available since 2006 and the failure analysis [5][6][7][8][9][10] [11] develop a resource failure prediction model for fine-grained cycle sharing systems. However, most of them focus on improving the predication accuracy, and few of them address how to leverage their predication results in practice.…”
Section: Related Workmentioning
confidence: 99%
“…Studies in [5][6][7][8][9] recognize the temporal and spatial correlation of failures in large scale cluster systems. In order to mimic these failure characteristics in the simulations, we choose to use the failure traces published by Los Alamos National Laboratory [23].…”
Section: Failure Traces and Prediction Accuracymentioning
confidence: 99%
See 1 more Smart Citation