Extreme-scale computing systems are required to solve some of the grand challenges in science and technology. From astrophysics to molecular biology, supercomputers are an essential tool to accelerate scientific discovery. However, large computing systems are prone to failures due to their complexity. It is crucial to develop an understanding of how these systems fail to design reliable supercomputing platforms for the future. This paper examines a five-year failure and workload record of a leadership-class supercomputer. To the best of our knowledge, five years represents the vast majority of the lifespan of a supercomputer. This is the first time such analysis is performed on a top 10 modern supercomputer. We performed a failure categorization and found out that: i) most errors are GPUrelated, with roughly 37% of them being double-bit errors on the cards; ii) failures are not evenly spread across the physical machine, with room temperature presumably playing a major role; and iii) software errors of the system bring down several nodes concurrently. Our failure rate analysis unveils that: i) the system consistently degrades, being at least twice as reliable at the beginning, compared to the end of the period; ii) Weibull distribution closely fits the meantime between failure data; and iii) hardware and software errors show a markedly different pattern. Finally, we correlated failure and workload records to reveal that: i) failure and workload records are weakly correlated, except for certain types of failures when segmented by the hours of the day; ii) several categories of failures make jobs crash within the first minutes of execution; and iii) a significant fraction of failed jobs exhaust the requested time with a disregard of when the failure occurred during execution.
Granulomatous amoebic encephalitis caused by free‐living amoebae is a rare condition that is difficult to diagnose and hard to treat, generally being fatal. Anti‐amoebic treatment is often delayed because clinical signs and symptoms may hide the probable causing agent misleading the appropriate diagnostic test. There are four genera of free‐living amoeba associated with human infection, Naegleria, Acanthamoeba sp., Balamuthia and Sappinia. Two boys were admitted with diagnosis of acute encephalitis. The history of having been in contact with swimming pools and rivers, supports the suspicion of an infection due to free‐living amoebae. In both cases a brain biopsy was done, the histology confirmed granulomatous amoebic encephalitis with the presence of amoebic trophozoites.
Deep learning (DL) applications are increasingly being deployed on HPC systems, to leverage the massive parallelism and computing power of those systems for DL model training. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. In this work, we evaluate checkpoint-restart, a common fault tolerance technique in HPC workloads. We perform experiments with three state-of-the-art DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide takeaway points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.
Los modelos de Aprendizaje Profundo se han convertido en una valiosa herramienta para resolver problemas complejos en muchas áreas críticas. Es importante proveer confiabilidad en las salidas de la ejecución de estos modelos, aún si se producen fallos durante la ejecución. En este artículo presentamos la evaluación de la confiabilidad de tres modelos de aprendizaje profundo. Usamos un conjunto de datos de ImageNet y desarrollamos un inyector de fallos para realizar las pruebas. Los resultados muestran que entre los modelos hay una diferencia en la sensibilidad a los fallos. Además, hay modelos que a pesar del incremento en la tasa de fallos pueden mantener bajos los valores de error.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.