Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2017
DOI: 10.1145/3126908.3126937
|View full text |Cite
|
Sign up to set email alerts
|

Failures in large scale systems

Abstract: Resilience is one of the key challenges in maintaining high eiciency of future extreme scale supercomputers. Researchers and system practitioners rely on ield-data studies to understand reliability characteristics and plan for future HPC systems. In this work, we compare and contrast the reliability characteristics of multiple largescale HPC production systems. Our study covers more than one billion compute node hours across ive diferent systems over a period of 8 years. We conirm previous indings which contin… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
6
1
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 105 publications
(17 citation statements)
references
References 39 publications
0
17
0
Order By: Relevance
“…They present several conclusions, for example that some nodes experience significantly more failures than others (even if hardware is identical) and once a node fails, it is likely to experience follow-up failures. Gupta et al [85] perform a large in-depth study using data from more than one billion compute hours across five different supercomputers over a period of 8 years. They present many findings, including that failures show temporal recurrence, failures show spatial locality, and reliability of HPC systems has barely changed over generations.…”
Section: Fault Analysismentioning
confidence: 99%
“…They present several conclusions, for example that some nodes experience significantly more failures than others (even if hardware is identical) and once a node fails, it is likely to experience follow-up failures. Gupta et al [85] perform a large in-depth study using data from more than one billion compute hours across five different supercomputers over a period of 8 years. They present many findings, including that failures show temporal recurrence, failures show spatial locality, and reliability of HPC systems has barely changed over generations.…”
Section: Fault Analysismentioning
confidence: 99%
“…Our study takes into account GPU errors too, but we did not analyze these failures deeply. In [25], [7], [26], they also used Titan failures. Not only did they analyze GPU failures, but also they analyzed failure events related to processor, memory, and system-user software.…”
Section: Related Workmentioning
confidence: 99%
“…In fact, one may expect that a hardware failure will occur in exascalesystems every 30 to 60 min (Cappello et al, 2014;Dongarra et al, 2015;Snir et al, 2014). High Performance Computing (HPC) systems can fail due to core hangs, kernel panics, file system errors, file server failures, corrupted memories or interconnects, network outages, air conditioning failures, or power halts (Gupta et al, 2017;Lu, 2013). Common metrics to characterize the resilience of hardware are the Mean Time Between Failure (MTBF) for repairable components and the Mean Time To Failure (MTTF) for non-repairable components.…”
Section: Introductionmentioning
confidence: 99%
“…As the number of cores that can be used by scientific software increases, the MTTF decreases rapidly. Gupta et al (2017) report the MTTF of four systems in the petaflops range containing up to 18 688 nodes (see Table 1). Currently, most compute jobs only use a fraction of these petascale systems.…”
Section: Introductionmentioning
confidence: 99%