2019 IEEE International Conference on Cluster Computing (CLUSTER) 2019
DOI: 10.1109/cluster.2019.8891034
|View full text |Cite
|
Sign up to set email alerts
|

Algorithm-Based Fault Tolerance for Parallel Stencil Computations

Abstract: The increase in HPC systems size and complexity, together with increasing on-chip transistor density, power limitations, and number of components, render modern HPC systems subject to soft errors. Silent data corruptions (SDCs) are typically caused by such soft errors in the form of bit-flips in the memory subsystem and hinder the correctness of scientific applications. This work addresses the problem of protecting a class of iterative computational kernels, called stencils, against SDCs when executing on para… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
3
2

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 39 publications
0
1
0
Order By: Relevance
“…ABFT concepts have been extended to process failures for a wide range of matrix operations both for detection and mitigation purposes (Bosilca et al, 2009; Chen and Dongarra, 2008; Du et al, 2012; Jia et al, 2013; Kim et al, 1996) and general communication patterns (Kabir and Goswami, 2016). ABFT has also recently been proposed for parallel stencil-based operations to accurately detect and correct silent data corruptions (Cavelan and Ciorba, 2019). In these scenarios the general strategy is a combination of checkpointing and replication of checksums.…”
Section: Numerical Algorithms For Resiliencementioning
confidence: 99%
“…ABFT concepts have been extended to process failures for a wide range of matrix operations both for detection and mitigation purposes (Bosilca et al, 2009; Chen and Dongarra, 2008; Du et al, 2012; Jia et al, 2013; Kim et al, 1996) and general communication patterns (Kabir and Goswami, 2016). ABFT has also recently been proposed for parallel stencil-based operations to accurately detect and correct silent data corruptions (Cavelan and Ciorba, 2019). In these scenarios the general strategy is a combination of checkpointing and replication of checksums.…”
Section: Numerical Algorithms For Resiliencementioning
confidence: 99%