Online Fault Classification in HPC Systems Through Machine Learning

Netti, Alessio; Kızıltan, Zeynep; Babaoğlu, Özalp; Sîrbu, Alina; Bartolini, Andrea; Borghesi, Andrea

doi:10.1007/978-3-030-29400-7_1

Cited by 6 publications

(7 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The results are very clear: both models were able to correctly learn to distinguish between normal states and anomalous ones, as highlighted by the very high values of precision, recall, and F-score. The accuracy number are on par with those reported by the state-of-the-art [10], [41], a significant first result for real failures happening on production supercomputing nodes. This is an important and promising step towards the adoption of Nagios as an annotation method.…”

Section: B Supervised Approach Resultssupporting

confidence: 59%

“…We start by considering the results of the pure supervised approach. This is the kind of approach proposed by the current state-of-the-art for fault classification in supercomputers (e.g., see [10], [41]). However, the works in the literature consider artificially injected faults and not real production anomalies; they also focus on "reliable" labels which reflect the actual underlying change in the target systems.…”

Section: B Supervised Approach Resultsmentioning

confidence: 99%

See 1 more Smart Citation

Anomaly Detection and Anticipation in High Performance Computing Systems

Borghesi

Molan

Milano

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

In their quest towards Exascale, High Performance Computing (HPC) systems are rapidly becoming larger and more complex, together with the issues concerning their maintenance. Luckily, many current HPC systems are endowed with data monitoring infrastructures that characterize the system state, and whose data can be used to train Deep Learning (DL) anomaly detection models, a very popular research area. However, the lack of labels describing the state of the system is a wide-spread issue, as annotating data is a costly task, generally falling on human system administrators and thus does not scale toward exascale.In this work we investigate the possibility to extract labels from a service monitoring tool (Nagios) currently used by HPC system administrators to flag the nodes which undergo maintenance operations. This allows to automatically annotate data collected by a fine-grained monitoring infrastructure; this labelled data is then used to train and validate a DL model for anomaly detection. We conduct the experimental evaluation on a tier-0 production supercomputer hosted at CINECA, Bologna, Italy. The results reveal that the DL model can accurately detect the real failures, and, moreover, it can predict the insurgency of anomalies, by systematically anticipating the actual labels (i.e. the moment when system administrators realize when an anomalous event happened); the average advance time computed on historical traces is around 45 minutes. The proposed technology can be easily scaled toward exascale systems to easy their maintenance.

show abstract

Section: B Supervised Approach Resultssupporting

confidence: 59%

Section: B Supervised Approach Resultsmentioning

confidence: 99%

Anomaly Detection and Anticipation in High Performance Computing Systems

Borghesi

Molan

Milano

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, problems may arise when dealing with sparse, biased, or time-dependent data, in which cases the naive use of machine learning can result in ill-posed problems and generate non-physical predictions (Peng et al, 2020). The existing online learning techniques implemented on HPC (Tuncer et al, 2018;Borghesi et al, 2019;Netti et al, 2019) fail to integrate underlying physics prior which constrains the space of admissible solutions. Therefore, there are still challenges for achieving honest precision across the entire scales for general physics processes, but our BIOL opens the door to a new era of real-time analysis for in silico simulations that could save significant computing time and disk space while extending the reach of physics searches and precision measurements at the biological processes and beyond.…”

Section: Related Workmentioning

confidence: 99%

Online Machine Learning for Accelerating Molecular Dynamics Modeling of Cells

Zhang

Han

et al. 2022

Front. Mol. Biosci.

View full text Add to dashboard Cite

We developed a biomechanics-informed online learning framework to learn the dynamics with ground truth generated with multiscale modeling simulation. It was built on Summit-like supercomputers, which were also used to benchmark and validate our framework on one physiologically significant modeling of deformable biological cells. We generalized the century-old equation of Jeffery orbits to a new equation of motion with additional parameters to account for the flow conditions and the cell deformability. Using simulation data at particle-based resolutions for flowing cells and the learned parameters from our framework, we validated the new equation by the motions, mostly rotations, of a human platelet in shear blood flow at various shear stresses and platelet deformability. Our online framework, which surrogates redundant computations in the conventional multiscale modeling by solutions of our learned equation, accelerates the conventional modeling by three orders of magnitude without visible loss of accuracy.

show abstract

“…As such they do not use time series of temperatures as input but aggregated metrics such as the mean and the standard deviation. The solutions provided by Tuncer et al 50 and Netti et al 51 follow the same concept as our solution: analyzing the metrics provided by the system and building a machine learning model to classify these metrics according to an error state. However, in their cases, studied errors have been created by a fault injector.…”

Section: Related Workmentioning

confidence: 99%

CPU overheating prediction in HPC systems

Platini

Ropars

Pelletier

et al. 2021

Concurrency and Computation

View full text Add to dashboard Cite

Summary With the increase in size of supercomputers, also increases the number of abnormal events. CPU overheating is one such event that decreases the system efficiency: when a CPU overheats, it reduces its frequency. This paper presents a machine learning solution to predict such events. The proposed algorithm is based on dynamic time warping for feature extraction and on a machine learning algorithm for classification. It predicts overheating events solely by analyzing the trends of the temperature of the CPUs and can deal with very low temperature sampling rates while having a negligible computational cost in practice. Our evaluation, using data coming from a production supercomputer, shows that the proposed solution can make predictions a few minutes in advance with a good accuracy. Furthermore, considering two simple preventive actions to avoid CPU overheating events, we present an analytical study that shows that our predictive solution is good enough to allow a significant reduction of the cost of overheating events.

show abstract

Online Fault Classification in HPC Systems Through Machine Learning

Cited by 6 publications

References 14 publications

Anomaly Detection and Anticipation in High Performance Computing Systems

Anomaly Detection and Anticipation in High Performance Computing Systems

Online Machine Learning for Accelerating Molecular Dynamics Modeling of Cells

CPU overheating prediction in HPC systems

Contact Info

Product

Resources

About