2018
DOI: 10.14569/ijacsa.2018.091207
|View full text |Cite
|
Sign up to set email alerts
|

A Two-Level Fault-Tolerance Technique for High Performance Computing Applications

Abstract: Reliability is the biggest concern facing future extreme-scale, high performance computing (HPC) systems. Within the current generation of HPC systems, projections suggest that errors will occur with very high rates in future systems. Thus, it is fundamental that we detect errors that can cause the failure of important applications, such as scientific ones. In this paper, we have presented a two-level fault-tolerance approach for the detection and classification of errors for Compute United Device Architecture… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 33 publications
0
1
0
Order By: Relevance
“…A state-of-the-art failure detection, prediction, and recovery techniques in exascale systems has introduced [15]. In research [16], [17], a fault tolerant framework is designed and implemented for heterogeneous applications to increase scalability. A various gossip-based algorithms were implemented for failure detection and consensus [18]- [20].…”
Section: Introductionmentioning
confidence: 99%
“…A state-of-the-art failure detection, prediction, and recovery techniques in exascale systems has introduced [15]. In research [16], [17], a fault tolerant framework is designed and implemented for heterogeneous applications to increase scalability. A various gossip-based algorithms were implemented for failure detection and consensus [18]- [20].…”
Section: Introductionmentioning
confidence: 99%