2019 IEEE 26th International Conference on High Performance Computing, Data, and Analytics (HiPC) 2019
DOI: 10.1109/hipc.2019.00047
|View full text |Cite
|
Sign up to set email alerts
|

Reducing False Node Failure Predictions in HPC

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(10 citation statements)
references
References 32 publications
0
10
0
Order By: Relevance
“…Second, previous studies evaluate the proposed methods using classical prediction metrics, such as precision, recall and F1-score. Although these classical metrics are often suitable, various studies [12]- [15] and our results show that they are insufficient for evaluation of HPC failure predictors because they cannot be used to determine whether prediction is useful in practice.…”
Section: Introductionmentioning
confidence: 80%
See 1 more Smart Citation
“…Second, previous studies evaluate the proposed methods using classical prediction metrics, such as precision, recall and F1-score. Although these classical metrics are often suitable, various studies [12]- [15] and our results show that they are insufficient for evaluation of HPC failure predictors because they cannot be used to determine whether prediction is useful in practice.…”
Section: Introductionmentioning
confidence: 80%
“…Although these classical metrics are often suitable, various studies [12]- [15] show that they are insufficient for evaluation of HPC failure predictors. This is because, as shown in Section V-D, they are not correlated with a cost-benefit analysis, and therefore cannot be used to decide whether and for which model parameters the prediction is useful in practice.…”
Section: B Precision Recall and F1-scorementioning
confidence: 99%
“…Failure detection and prediction at the component level are widely researched in such fields as cloud, grid, and high-performance computing (HPC) using various techniques, such as artificial intelligence (AI), machine learning (ML), and rule-based and probabilistic models. 32,60,61 The common attributes of such environments are that they often consist of many server nodes and manage and process critical data. 62 Thus, the failure of the SPOF components may have a severe cost, such as the loss of revenue when a corresponding application is unavailable.…”
Section: Related Workmentioning
confidence: 99%
“…The results showed that the Random Forest algorithm achieved the best accuracy. Frank et al [67] tried to identify failed nodes that are being used by running large-scale applications on the HPC system. The authors proposed a new feature-based system for node failure predictors using machine learning with a low percentage of false alarms at large scales.…”
Section: Related Workmentioning
confidence: 99%
“…References Features References Skewness [54], [55], [56], [58], [61], [62], [63], [64], [65], [66] Count above mean [60] Kurtosis [54], [55], [56], [58], [61], [62], [63], [64], [65], [66] Count below mean [60] Mean [56], [58], [59], [60], [62], [64], [66], [67], [68], Historical change [60] Autocorrelation or Serial correlation [54], [55], [59], [61], [62], [63], [65] Simple moving average [60] Standard deviation [55], [56], [58], [62], [63], [64], [67] Weighted moving average [60] C3 (nonlinearity) [54], [55], [61], …”
Section: Featuresmentioning
confidence: 99%