2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS) 2016
DOI: 10.1109/ipdps.2016.111
|View full text |Cite
|
Sign up to set email alerts
|

Fault Modeling of Extreme Scale Applications Using Machine Learning

Abstract: Faults are commonplace in large scale systems. These systems experience a variety of faults such as transient, permanent and intermittent. Multi-bit faults are typically not corrected by the hardware resulting in an error. This paper attempts to answer an important question: Given a multi-bit fault in main memory, will it result in an application errorand hence a recovery algorithm should be invoked-or can it be safely ignored? We propose an application fault modeling methodology to answer this question. Given… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
15
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(15 citation statements)
references
References 28 publications
0
15
0
Order By: Relevance
“…Some works use collaborative filtering to colocate tasks in clouds by estimating application interference [30]. Others are closer to the application level and use binary classification to distinguish benign memory faults from application errors in order to execute recovery algorithms (see [31] for instance).…”
Section: Data-aware Resource Managementmentioning
confidence: 99%
“…Some works use collaborative filtering to colocate tasks in clouds by estimating application interference [30]. Others are closer to the application level and use binary classification to distinguish benign memory faults from application errors in order to execute recovery algorithms (see [31] for instance).…”
Section: Data-aware Resource Managementmentioning
confidence: 99%
“…ML-based methods have been used in several ways. Vishnu et al [18] proposed a combination of eight ML algorithms, to assess the impact of multi-bit memory errors on HPC applications and to compare their predictions with the fault-injection results. In [19], the authors used linearregression techniques to backtrack the propagation of soft errors through processes dependent on many-core processor systems.…”
Section: Related Workmentioning
confidence: 99%
“…While this approach focus on the reduction the of total time to complete the fault injection campaigns, our approach aims at correlating large subsets of application profiles and architecture characteristics with fault injection results in order to pinpoint the most relevant parameters/traces on the target system. Vishnu et al [12] evaluate the impact of multi-bit memory errors, both permanent and transient, on HPC applications. This work considers eight different ML algorithms (e.g., support vector machines, k-Nearest neighbors, three distinct decision trees), comparing their predictions (i.e., the error probability) with the ground-truth (i.e., fault injection results).…”
Section: Review Of Fault Injection Approaches Using Virtual Platfmentioning
confidence: 99%