Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering 2018
DOI: 10.1145/3184407.3184421
|View full text |Cite
|
Sign up to set email alerts
|

Pattern-based Modeling of Multiresilience Solutions for High-Performance Computing

Abstract: Resiliency is the ability of large-scale high-performance computing (HPC) applications to gracefully handle errors, and recover from failures. In this paper, we propose a pattern-based approach to constructing resilience solutions that handle multiple error modes. Using resilience patterns, we evaluate the performance and reliability characteristics of detection, containment and mitigation techniques for transient errors that cause silent data corruptions and techniques for fail-stop errors that result in proc… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2018
2018
2022
2022

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(1 citation statement)
references
References 12 publications
0
1
0
Order By: Relevance
“…Reusable programming templates of these patterns can offer resilience portability across different HPC system architectures and permit design space exploration and adaptation to different (performance, resilience and power consumption) design trade-offs. An early prototype (Ashraf et al, 2018) offers multi-resilience for detection, containment and mitigation of SDC and MPI process failures.…”
Section: System Infrastructure Techniques For Resiliencementioning
confidence: 99%
“…Reusable programming templates of these patterns can offer resilience portability across different HPC system architectures and permit design space exploration and adaptation to different (performance, resilience and power consumption) design trade-offs. An early prototype (Ashraf et al, 2018) offers multi-resilience for detection, containment and mitigation of SDC and MPI process failures.…”
Section: System Infrastructure Techniques For Resiliencementioning
confidence: 99%