2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing 2015
DOI: 10.1109/pdp.2015.17
|View full text |Cite
|
Sign up to set email alerts
|

NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart

Abstract: In this paper, we present NanoCheckpoints which is a lightweight software-based checkpoint/restart scheme for taskparallel HPC applications. We leverage OmpSs, a task-based OpenMP derivative programming model (PM) and its Nanos asynchronous dataflow runtime. NanoCheckpoints achieves minimal overheads by checkpointing only tasks' inputs which are available for free in the OmpSs PM. We evaluate NanoCheckpoints by both pure task-parallel shared memory benchmarks (up to 16 cores) and hybrid OmpSs+MPI applications … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
20
0

Year Published

2016
2016
2022
2022

Publication Types

Select...
4
3
2

Relationship

2
7

Authors

Journals

citations
Cited by 26 publications
(20 citation statements)
references
References 9 publications
0
20
0
Order By: Relevance
“…The process which runs a task is generally capable of detecting task faults. There are several examples of shared memory runtimes which are capable of detecting and correcting task faults within parallel regions [16,26].…”
Section: Task Faultsmentioning
confidence: 99%
“…The process which runs a task is generally capable of detecting task faults. There are several examples of shared memory runtimes which are capable of detecting and correcting task faults within parallel regions [16,26].…”
Section: Task Faultsmentioning
confidence: 99%
“…Prior to the incorporation of the OmpSs offload functionality, "smart" C/R was introduced in the Nanos++ runtime to provide efficient fault-tolerance capabilities by benefiting from the PM semantics (leveraging the task data dependencies) [19]. This protected from memory faults reported by the OS.…”
Section: Resilience In Task-based Pmsmentioning
confidence: 99%
“…The work of Subasi et al [38] is based on programmer knowledge in order to achieve effective partial replication. The NanoCheckpoints [36] and the message logging protocol proposed by Martsinkevich et al [27] address fail-stop errors of task-parallel computations. SSD [37] is designed by using machine learning techniques to mitigate silent errors in HPC applications.…”
Section: Related Workmentioning
confidence: 99%