2019
DOI: 10.1007/978-3-030-29400-7_1
|View full text |Cite
|
Sign up to set email alerts
|

Online Fault Classification in HPC Systems Through Machine Learning

Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
6
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
1

Relationship

3
3

Authors

Journals

citations
Cited by 6 publications
(7 citation statements)
references
References 14 publications
1
6
0
Order By: Relevance
“…The results are very clear: both models were able to correctly learn to distinguish between normal states and anomalous ones, as highlighted by the very high values of precision, recall, and F-score. The accuracy number are on par with those reported by the state-of-the-art [10], [41], a significant first result for real failures happening on production supercomputing nodes. This is an important and promising step towards the adoption of Nagios as an annotation method.…”
Section: B Supervised Approach Resultssupporting
confidence: 59%
See 1 more Smart Citation
“…The results are very clear: both models were able to correctly learn to distinguish between normal states and anomalous ones, as highlighted by the very high values of precision, recall, and F-score. The accuracy number are on par with those reported by the state-of-the-art [10], [41], a significant first result for real failures happening on production supercomputing nodes. This is an important and promising step towards the adoption of Nagios as an annotation method.…”
Section: B Supervised Approach Resultssupporting
confidence: 59%
“…We start by considering the results of the pure supervised approach. This is the kind of approach proposed by the current state-of-the-art for fault classification in supercomputers (e.g., see [10], [41]). However, the works in the literature consider artificially injected faults and not real production anomalies; they also focus on "reliable" labels which reflect the actual underlying change in the target systems.…”
Section: B Supervised Approach Resultsmentioning
confidence: 99%
“…However, problems may arise when dealing with sparse, biased, or time-dependent data, in which cases the naive use of machine learning can result in ill-posed problems and generate non-physical predictions (Peng et al, 2020). The existing online learning techniques implemented on HPC (Tuncer et al, 2018;Borghesi et al, 2019;Netti et al, 2019) fail to integrate underlying physics prior which constrains the space of admissible solutions. Therefore, there are still challenges for achieving honest precision across the entire scales for general physics processes, but our BIOL opens the door to a new era of real-time analysis for in silico simulations that could save significant computing time and disk space while extending the reach of physics searches and precision measurements at the biological processes and beyond.…”
Section: Related Workmentioning
confidence: 99%
“…As such they do not use time series of temperatures as input but aggregated metrics such as the mean and the standard deviation. The solutions provided by Tuncer et al 50 and Netti et al 51 follow the same concept as our solution: analyzing the metrics provided by the system and building a machine learning model to classify these metrics according to an error state. However, in their cases, studied errors have been created by a fault injector.…”
Section: Related Workmentioning
confidence: 99%