2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD) 2019
DOI: 10.1109/sbac-pad.2019.00040
|View full text |Cite
|
Sign up to set email alerts
|

Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer

Abstract: Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class su… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
7
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 18 publications
0
7
0
Order By: Relevance
“…The very large total number of components will lead to frequent failures, even though the mean time between failures (MTBF) for the individual components may be large. For instance, while the MTBF of a CPU can be months to years [2], that of current supercomputers can be within a few hours [3,4]. With billion-core parallelism at exascale, the MTBF has been projected to be within (or even far below) one hour [5,6].…”
Section: Introductionmentioning
confidence: 99%
“…The very large total number of components will lead to frequent failures, even though the mean time between failures (MTBF) for the individual components may be large. For instance, while the MTBF of a CPU can be months to years [2], that of current supercomputers can be within a few hours [3,4]. With billion-core parallelism at exascale, the MTBF has been projected to be within (or even far below) one hour [5,6].…”
Section: Introductionmentioning
confidence: 99%
“…Recent studies show that modern HPC systems have several failures per day [16], [32], consequently, we set MTBF in equations 5 and 6 equal to 6 hours. This choice corresponds to the estimated MTBF when the application uses 15K cores [9].…”
Section: B Performance Measurementsmentioning
confidence: 99%
“…The very large total number of components will lead to frequent failures, even though the mean time between failures (MTBF) for the individual components may be large. For instance, while the MTBF of a CPU can be months to years [2], that of current supercomputers can be a few hours or less [3], [4]. With billion-core parallelism at exascale, the MTBF has been projected to be within (or even significantly below) one hour [5], [6].…”
Section: Introductionmentioning
confidence: 99%