NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart

Arias, J.R.; Unsal, Osman S.; Labarta, Jesús; Cristal, Adrián

doi:10.1109/pdp.2015.17

Cited by 26 publications

(20 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The process which runs a task is generally capable of detecting task faults. There are several examples of shared memory runtimes which are capable of detecting and correcting task faults within parallel regions [16,26].…”

Section: Task Faultsmentioning

confidence: 99%

A taxonomy of task-based parallel programming technologies for high-performance computing

et al. 2018

View full text Add to dashboard Cite

Task-based programming models for shared memory-such as Cilk Plus and OpenMP 3-are well established and documented. However, with the increase in parallel, many-core, and heterogeneous systems, a number of research-driven projects have developed more diversified task-based support, employing various programming and runtime features. Unfortunately, despite the fact that dozens of different task-based systems exist today and are actively used for parallel and high-performance computing (HPC), no comprehensive overview or classification of task-based technologies for HPC exists. In this paper, we provide an initial task-focused taxonomy for HPC technologies, which covers both programming interfaces and runtime mechanisms. We

show abstract

Section: Task Faultsmentioning

confidence: 99%

A taxonomy of task-based parallel programming technologies for high-performance computing

et al. 2018

View full text Add to dashboard Cite

show abstract

“…Prior to the incorporation of the OmpSs offload functionality, "smart" C/R was introduced in the Nanos++ runtime to provide efficient fault-tolerance capabilities by benefiting from the PM semantics (leveraging the task data dependencies) [19]. This protected from memory faults reported by the OS.…”

Section: Resilience In Task-based Pmsmentioning

confidence: 99%

Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques

Peña

Beltrán

Clauss

et al. 2017

Proceedings of the International Conference on Supercomputing

View full text Add to dashboard Cite

In this paper we describe the design of fault tolerance capabilities for general-purpose offload semantics, based on the OmpSs programming model. Using ParaStation MPI, a production MPI-3.1 implementation, we explore the features that, being standard compliant, an MPI stack must support to provide the necessary fault tolerance guarantees, based on MPI's dynamic process management. Our results, including synthetic benchmarks and applications, reveal low runtime overhead and efficient recovery, demonstrating that the existing MPI standard provided us with sufficient mechanisms to implement an effective and efficient fault-tolerant solution.

show abstract

“…The work of Subasi et al [38] is based on programmer knowledge in order to achieve effective partial replication. The NanoCheckpoints [36] and the message logging protocol proposed by Martsinkevich et al [27] address fail-stop errors of task-parallel computations. SSD [37] is designed by using machine learning techniques to mitigate silent errors in HPC applications.…”

Section: Related Workmentioning

confidence: 99%

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

Subaşi

Yalcin

Zyulkyarov

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Abstract-In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

show abstract

NanoCheckpoints: A Task-Based Asynchronous Dataflow Framework for Efficient and Scalable Checkpoint/Restart

Cited by 26 publications

References 9 publications

A taxonomy of task-based parallel programming technologies for high-performance computing

A taxonomy of task-based parallel programming technologies for high-performance computing

Supporting automatic recovery in offloaded distributed programming models through MPI-3 techniques

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

Contact Info

Product

Resources

About