Proficiency Metrics for Failure Prediction in High Performance Computing

Taerat, Narate; Leangsuksun, Chokchai; Chandler, Clayton; Naksinehaboon, Nichamon

doi:10.1109/ispa.2010.84

Cited by 5 publications

(4 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Second, previous studies evaluate the proposed methods using classical prediction metrics, such as precision, recall and F1-score. Although these classical metrics are often suitable, various studies [12]- [15] and our results show that they are insufficient for evaluation of HPC failure predictors because they cannot be used to determine whether prediction is useful in practice.…”

Section: Introductionmentioning

confidence: 80%

“…Although these classical metrics are often suitable, various studies [12]- [15] show that they are insufficient for evaluation of HPC failure predictors. This is because, as shown in Section V-D, they are not correlated with a cost-benefit analysis, and therefore cannot be used to decide whether and for which model parameters the prediction is useful in practice.…”

Section: B Precision Recall and F1-scorementioning

confidence: 99%

“…The objective of the HPC failure prediction mechanism is to increase the effective use of the HPC system by reducing the compute time lost due to failures [12], [13]. We therefore perform a cost-benefit analysis that compares the system resources needed for training, failure prediction and failure mitigation against the saved compute time due to successful failure prediction and mitigation.…”

Section: Cost-benefit Calculationmentioning

confidence: 99%

See 2 more Smart Citations

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field

Boixaderas

Zivanovic

Moré

et al. 2020

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

This paper presents and evaluates a method to predict DRAM uncorrected errors, a leading cause of hardware failures in large-scale HPC clusters. The method uses a random forest classifier, which was trained and evaluated using error logs from two years of production of the MareNostrum 3 supercomputer. By enabling the system to take measures to mitigate node failures, our method reduces lost compute time by up to 57%, a net saving of 21,000 node-hours per year. We release all source code as open source.We also discuss and clarify aspects of methodology that are essential for a DRAM prediction method to be useful in practice. We explain why standard evaluation metrics, such as precision and recall, are insufficient, and base the evaluation on a cost-benefit analysis. This methodology can help ensure that any DRAM error predictor is clear from training bias and has a clear cost-benefit calculation.

show abstract

Section: Introductionmentioning

confidence: 80%

Section: B Precision Recall and F1-scorementioning

confidence: 99%

Section: Cost-benefit Calculationmentioning

confidence: 99%

See 1 more Smart Citation

Cost-Aware Prediction of Uncorrected DRAM Errors in the Field

Boixaderas

Zivanovic

Moré

et al. 2020

SC20: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…In addition to the previous approaches for failure prediction and analysis, the literature includes other approaches that measure the quality of failure prediction. For example, in [12], the authors proposed a new metric for measuring the failure prediction error. Instead of using the mean square error, precision, or recall, they used a metric called "lost computing time."…”

Section: Introductionmentioning

confidence: 99%

Scalable Approach to Failure Analysis of High-Performance Computing Systems

Shawky

2014

ETRI J

View full text Add to dashboard Cite

Failure analysis is necessary to clarify the root cause of a failure, predict the next time a failure may occur, and improve the performance and reliability of a system. However, it is not an easy task to analyze and interpret failure data, especially for complex systems. Usually, these data are represented using many attributes, and sometimes they are inconsistent and ambiguous. In this paper, we present a scalable approach for the analysis and interpretation of failure data of high-performance computing systems. The approach employs rough sets theory (RST) for this task. The application of RST to a large publicly available set of failure data highlights the main attributes responsible for the root cause of a failure. In addition, it is used to analyze other failure characteristics, such as time between failures, repair times, workload running on a failed node, and failure category. Experimental results show the scalability of the presented approach and its ability to reveal dependencies among different failure characteristics.

show abstract