2021
DOI: 10.1007/978-3-030-90539-2_25
|View full text |Cite|
|
Sign up to set email alerts
|

An Explainable Model for Fault Detection in HPC Systems

Abstract: Large supercomputers are composed of numerous components that risk to break down or behave in unwanted manners. Identifying broken components is a daunting task for system administrators. Hence an automated tool would be a boon for the systems resiliency. The wealth of data available in a supercomputer can be used for this task. In this work we propose an approach to take advantage of holistic data centre monitoring, system administrator node status labeling and an explainable model for fault detection in supe… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 19 publications
0
4
0
Order By: Relevance
“…The proposed model architecture achieves the highest AUC of 0.77 compared to 0.75, which is the highest AUC achieved by the dense autoencoders (on our dataset). Another contribution of this paper is that the proposed method -unlike the previous work [5,16,17,8] achieves the best results in an unsupervised training case. Unsupervised training is instrumental as it offers a possibility of deploying an anomaly detection model to the cases where (accurately) labelled dataset is unavailable.…”
Section: Discussionmentioning
confidence: 92%
See 2 more Smart Citations
“…The proposed model architecture achieves the highest AUC of 0.77 compared to 0.75, which is the highest AUC achieved by the dense autoencoders (on our dataset). Another contribution of this paper is that the proposed method -unlike the previous work [5,16,17,8] achieves the best results in an unsupervised training case. Unsupervised training is instrumental as it offers a possibility of deploying an anomaly detection model to the cases where (accurately) labelled dataset is unavailable.…”
Section: Discussionmentioning
confidence: 92%
“…All mentioned approaches use synthetic anomalies injected into the HPC system to train a supervised classification model. Approaches [5] and [16] are among the few that leverage real anomalies collected from production HPC systems (as opposed to injected anomalies). In this paper, we are interested in real anomalies, and thus, we will not include methods using synthetic/simulated data or injected anomalies in our quantitative comparisons.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…In recent years, owing to the increasing demand for high-performance computing (HPC) as well as the scale-up supercomputers and intelligent computing systems, the reliability of largescale computing systems has been investigated extensively [1][2][3][4]. The system operation is complex, and failures occur frequently which are difficult to detect, locate, diagnose, analyze, and debug [1,5,6].…”
Section: Introductionmentioning
confidence: 99%