Reducing False Node Failure Predictions in HPC

Frank, Alvaro; Yang, Dazhong; Brinkmann, André; Schulz, Martin; Süß, Tim

doi:10.1109/hipc.2019.00047

Cited by 9 publications

(10 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, previous studies evaluate the proposed methods using classical prediction metrics, such as precision, recall and F1-score. Although these classical metrics are often suitable, various studies [12]- [15] and our results show that they are insufficient for evaluation of HPC failure predictors because they cannot be used to determine whether prediction is useful in practice.…”

Section: Introductionmentioning

confidence: 80%

“…Although these classical metrics are often suitable, various studies [12]- [15] show that they are insufficient for evaluation of HPC failure predictors. This is because, as shown in Section V-D, they are not correlated with a cost-benefit analysis, and therefore cannot be used to decide whether and for which model parameters the prediction is useful in practice.…”

Section: B Precision Recall and F1-scorementioning

confidence: 99%

See 1 more Smart Citation

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field

Boixaderas

Zivanovic

Moré

et al. 2020

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node-hours per year. We release all source code as open source.We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost-benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost-benefit calculation.

show abstract

Section: Introductionmentioning

confidence: 80%

Section: B Precision Recall and F1-scorementioning

confidence: 99%

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field

Boixaderas

Zivanovic

Moré

et al. 2020

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…Failure detection and prediction at the component level are widely researched in such fields as cloud, grid, and high-performance computing (HPC) using various techniques, such as artificial intelligence (AI), machine learning (ML), and rule-based and probabilistic models. 32,60,61 The common attributes of such environments are that they often consist of many server nodes and manage and process critical data. 62 Thus, the failure of the SPOF components may have a severe cost, such as the loss of revenue when a corresponding application is unavailable.…”

Section: Related Workmentioning

confidence: 99%

Predicting locally manageable resource failures of high availability clusters

Somasekaram

Călinescu

2022

Softw Pract Exp

View full text Add to dashboard Cite

Critical services from domains as diverse as finance, manufacturing and healthcare are often delivered by complex enterprise applications (EAs). High-availability clusters (HACs) are software-managed IT infrastructures that enable these EAs to operate with minimum downtime. This paper presents a novel Bayesian decision network model to improve the failure detection capabilities of the HACs components using a comprehensive set of characteristics for the analysed component. The model then combines these characteristics to predict whether the failure of this component can be managed locally at the failed component level without propagating the failure to upper-level components and causing a complete system failure. By improving the detection capabilities and predicting locally manageable failures, the model improves the decision-making process of HACs, and has the potential to reduce the downtime and improve availability for the applications protected by HACs. The model uses the capabilities of the Bayesian decision networks, which combines Bayesian networks with the utility theory, to assign weights to different characteristics and consolidate the related variables to output the result. The model evaluation in a realistic testbed environment with three servers, an established HAC and a well-known EA shows that the model can improve the area under the Receiver Operating Characteristic (ROC) curve for prediction of locally manageable failures by up to 9.05% compared to the baseline HAC results.

show abstract

“…The results showed that the Random Forest algorithm achieved the best accuracy. Frank et al [67] tried to identify failed nodes that are being used by running large-scale applications on the HPC system. The authors proposed a new feature-based system for node failure predictors using machine learning with a low percentage of false alarms at large scales.…”

Section: Related Workmentioning

confidence: 99%

“…References Features References Skewness [54], [55], [56], [58], [61], [62], [63], [64], [65], [66] Count above mean [60] Kurtosis [54], [55], [56], [58], [61], [62], [63], [64], [65], [66] Count below mean [60] Mean [56], [58], [59], [60], [62], [64], [66], [67], [68], Historical change [60] Autocorrelation or Serial correlation [54], [55], [59], [61], [62], [63], [65] Simple moving average [60] Standard deviation [55], [56], [58], [62], [63], [64], [67] Weighted moving average [60] C3 (nonlinearity) [54], [55], [61], …”

Section: Featuresmentioning

confidence: 99%

KPIs-Based Clustering and Visualization of HPC Jobs: A Feature Reduction Approach

2021

View full text Add to dashboard Cite

High-Performance Computing (HPC) systems need to be constantly monitored to ensure their stability. The monitoring systems collect a tremendous amount of data about different parameters or Key Performance Indicators (KPIs), such as resource usage, IO waiting time, etc. A proper analysis of this data, usually stored as time series, can provide insight in choosing the right management strategies as well as the early detection of issues. In this paper, we introduce a methodology to cluster HPC jobs according to their KPI indicators. Our approach reduces the inherent high dimensionality of the collected data by applying two techniques to the time series: literature-based and variance-based feature extraction. We also define a procedure to visualize the obtained clusters by combining the two previous approaches and the Principal Component Analysis (PCA). Finally, we have validated our contributions on a real data set to conclude that those KPIs related to CPU usage provide the best cohesion and separation for clustering analysis and the good results of our visualization methodology. INDEX TERMS Clustering, feature extraction, high-performance computing, time series analysis.

show abstract

Reducing False Node Failure Predictions in HPC

Cited by 9 publications

References 32 publications

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field

Predicting locally manageable resource failures of high availability clusters

KPIs-Based Clustering and Visualization of HPC Jobs: A Feature Reduction Approach

Contact Info

Product

Resources

About