2016
DOI: 10.1177/1094342016664796
|View full text |Cite
|
Sign up to set email alerts
|

Exploring versioned distributed arrays for resilience in scientific applications

Abstract: Exascale studies project reliability challenges for future HPC systems. We present the Global View Resilience (GVR) system, a library for portable resilience. GVR begins with a subset of the Global Arrays interface, and adds new capabilities to create versions, name versions, and compute on version data. Applications can focus versioning where and when it is most productive, and customize for each application structure independently. This control is portable, and its embedding in application source makes it na… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(5 citation statements)
references
References 65 publications
0
5
0
Order By: Relevance
“…Complementary to approaches that focus on resiliency of computational blocks, the Global View Resilience (GVR) project [47] concentrates on application data and guarantees resilience through multiple snapshot versions of the data whose creation is controlled by the programmer through application annotations. Bridges et al [36] proposed a malloc_failable that uses a callback mechanism to handle memory failures on dynamically allocated memory, so that the application programmer can specify recovery actions.…”
Section: Programming Model Techniquesmentioning
confidence: 99%
See 2 more Smart Citations
“…Complementary to approaches that focus on resiliency of computational blocks, the Global View Resilience (GVR) project [47] concentrates on application data and guarantees resilience through multiple snapshot versions of the data whose creation is controlled by the programmer through application annotations. Bridges et al [36] proposed a malloc_failable that uses a callback mechanism to handle memory failures on dynamically allocated memory, so that the application programmer can specify recovery actions.…”
Section: Programming Model Techniquesmentioning
confidence: 99%
“…System-level solutions, such asDistributed MultiThreaded CheckPointing (DMTCP) [18], support transparent state saving and restoration using OS support. GVR [47] is a runtime system that provides fault tolerance to applications by versioning distributed arrays for checkpoint recovery, while the checkpoint-on-failure protocol [18] for MPI applications leverages the features of a high-quality fault-tolerant MPI implementation. In either case, algorithm-specific knowledge is needed to perform checkpoint recovery, Some ABFT solutions [122] can utilize the original or previously saved data as a replacement for lost or erroneous data and recover their state to the point at which the error/failure event occurred.…”
Section: Startmentioning
confidence: 99%
See 1 more Smart Citation
“…Global View Resilience (GVR) (Chien et al, 2017) accommodates APIs to enable multiple versioning of global arrays for the single program, multiple data programming model. The core idea is the fact that naive data redundancy approaches potentially store wrong applications states due to the large latency associated with error detection and notification.…”
Section: System Infrastructure Techniques For Resiliencementioning
confidence: 99%
“…In the context of HPC systems, software solutions typically implement roll-forward recovery using algorithm-specific knowledge. For example, Global View of Resilience (GVR) [13] uses versioning of distributed arrays supports, in which roll-forward recovery is based on application-specified mechanisms for each array structure.…”
Section: Roll-forward Patternmentioning
confidence: 99%