Failures in large scale systems

Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Tiwari, Devesh

doi:10.1145/3126908.3126937

Cited by 105 publications

(17 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They present several conclusions, for example that some nodes experience significantly more failures than others (even if hardware is identical) and once a node fails, it is likely to experience follow-up failures. Gupta et al [85] perform a large in-depth study using data from more than one billion compute hours across five different supercomputers over a period of 8 years. They present many findings, including that failures show temporal recurrence, failures show spatial locality, and reliability of HPC systems has barely changed over generations.…”

Section: Fault Analysismentioning

confidence: 99%

The Landscape of Exascale Research

et al. 2020

View full text Add to dashboard Cite

The next generation of supercomputers will break the exascale barrier. Soon we will have systems capable of at least one quintillion (billion billion) floating-point operations per second (10 18 FLOPS). Tremendous amounts of work have been invested into identifying and overcoming the challenges of the exascale era. In this work, we present an overview of these efforts and provide insight into the important trends, developments, and exciting research opportunities in exascale computing. We use a three-stage approach in which we (1) discuss various exascale landmark studies, (2) use data-driven techniques to analyze the large collection of related literature, and (3) discuss eight research areas in depth based on influential articles. Overall, we observe that great advancements have been made in tackling the two primary exascale challenges: energy efficiency and fault tolerance. However, as we look forward, we still foresee two major concerns: the lack of suitable programming tools and the growing gap between processor performance and data bandwidth (i.e., memory, storage, networks). Although we will certainly reach exascale soon, without additional research, these issues could potentially limit the applicability of exascale computing.

show abstract

Section: Fault Analysismentioning

confidence: 99%

The Landscape of Exascale Research

et al. 2020

View full text Add to dashboard Cite

show abstract

“…Our study takes into account GPU errors too, but we did not analyze these failures deeply. In [25], [7], [26], they also used Titan failures. Not only did they analyze GPU failures, but also they analyzed failure events related to processor, memory, and system-user software.…”

Section: Related Workmentioning

confidence: 99%

Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer

Rojas

Meneses

Jones

et al. 2019

2019 31st International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

View full text Add to dashboard Cite

Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class supercomputer. To the best of our knowledge, five years represents the vast majority of the lifespan of a supercomputer. This is the first time such analysis is performed on a top 10 modern supercomputer. We performed a failure categorization and found out that: i) most errors are GPUrelated, with roughly 37% of them being double-bit errors on the cards; ii) failures are not evenly spread across the physical machine, with room temperature presumably playing a major role; and iii) software errors of the system bring down several nodes concurrently. Our failure rate analysis unveils that: i) the system consistently degrades, being at least twice as reliable at the beginning, compared to the end of the period; ii) Weibull distribution closely fits the meantime between failure data; and iii) hardware and software errors show a markedly different pattern. Finally, we correlated failure and workload records to reveal that: i) failure and workload records are weakly correlated, except for certain types of failures when segmented by the hours of the day; ii) several categories of failures make jobs crash within the first minutes of execution; and iii) a significant fraction of failed jobs exhaust the requested time with a disregard of when the failure occurred during execution.

show abstract

“…In fact, one may expect that a hardware failure will occur in exascalesystems every 30 to 60 min (Cappello et al, 2014;Dongarra et al, 2015;Snir et al, 2014). High Performance Computing (HPC) systems can fail due to core hangs, kernel panics, file system errors, file server failures, corrupted memories or interconnects, network outages, air conditioning failures, or power halts (Gupta et al, 2017;Lu, 2013). Common metrics to characterize the resilience of hardware are the Mean Time Between Failure (MTBF) for repairable components and the Mean Time To Failure (MTTF) for non-repairable components.…”

Section: Introductionmentioning

confidence: 99%

“…As the number of cores that can be used by scientific software increases, the MTTF decreases rapidly. Gupta et al (2017) report the MTTF of four systems in the petaflops range containing up to 18 688 nodes (see Table 1). Currently, most compute jobs only use a fraction of these petascale systems.…”

Section: Introductionmentioning

confidence: 99%

Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG

Hübner

Kozlov

Hespe

et al. 2021

Preprint

View full text Add to dashboard Cite

Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required, and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood based phylogenetic tree inference. We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 2%. The overall slowdown by using these recovery mechanisms in conjunction with a fault tolerant MPI implementation amounts to 8% on average for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery, and failures during checkpointing. Recoveries are automatic and transparent to the user. The modified fault tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng Contact: lukas.huebner@{kit.edu,h-its.org};, alexey.kozlov@h-its.org, hespe@kit.edu, sanders@kit.edu, alexandros.stamatakis@hits.org Supplementary information: Supplementary data are available at bioRχiv.

show abstract

Failures in large scale systems

Cited by 105 publications

References 39 publications

The Landscape of Exascale Research

The Landscape of Exascale Research

Analyzing a Five-Year Failure Record of a Leadership-Class Supercomputer

Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG

Contact Info

Product

Resources

About