International Symposium on Parallel and Distributed Processing With Applications 2010
DOI: 10.1109/ispa.2010.84
|View full text |Cite
|
Sign up to set email alerts
|

Proficiency Metrics for Failure Prediction in High Performance Computing

Abstract: The number of failures occurring in large-scale high performance computing (HPC) systems is significantly increasing due to the large number of physical components found on the system. Fault tolerance (FT) mechanisms help parallel applications mitigate the impact of failures. However, using such mechanisms requires additional overhead. As such, failure prediction is needed in order to smartly utilize FT mechanisms. Hence, the proficiency of a failure prediction determines the efficiency of FT mechanism utiliza… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2014
2014
2020
2020

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(4 citation statements)
references
References 10 publications
0
4
0
Order By: Relevance
“…Second, previous studies evaluate the proposed methods using classical prediction metrics, such as precision, recall and F1-score. Although these classical metrics are often suitable, various studies [12]- [15] and our results show that they are insufficient for evaluation of HPC failure predictors because they cannot be used to determine whether prediction is useful in practice.…”
Section: Introductionmentioning
confidence: 80%
See 2 more Smart Citations
“…Second, previous studies evaluate the proposed methods using classical prediction metrics, such as precision, recall and F1-score. Although these classical metrics are often suitable, various studies [12]- [15] and our results show that they are insufficient for evaluation of HPC failure predictors because they cannot be used to determine whether prediction is useful in practice.…”
Section: Introductionmentioning
confidence: 80%
“…Although these classical metrics are often suitable, various studies [12]- [15] show that they are insufficient for evaluation of HPC failure predictors. This is because, as shown in Section V-D, they are not correlated with a cost-benefit analysis, and therefore cannot be used to decide whether and for which model parameters the prediction is useful in practice.…”
Section: B Precision Recall and F1-scorementioning
confidence: 99%
See 1 more Smart Citation
“…In addition to the previous approaches for failure prediction and analysis, the literature includes other approaches that measure the quality of failure prediction. For example, in [12], the authors proposed a new metric for measuring the failure prediction error. Instead of using the mean square error, precision, or recall, they used a metric called "lost computing time."…”
Section: Introductionmentioning
confidence: 99%