2015 IEEE International Conference on Cluster Computing 2015
DOI: 10.1109/cluster.2015.104
|View full text |Cite
|
Sign up to set email alerts
|

Fault-Tolerant Protocol for Hybrid Task-Parallel Message-Passing Applications

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
14
0
2

Year Published

2015
2015
2021
2021

Publication Types

Select...
3
2
2

Relationship

3
4

Authors

Journals

citations
Cited by 14 publications
(16 citation statements)
references
References 16 publications
0
14
0
2
Order By: Relevance
“…In the area of High-Performance Computing (HPC), parallel systems continue increasing the number of components to improve their performance and, as a consequence, ensuring their reliability has become a critical issue. Nowadays, fault rates involve just a few hours on modern platforms [1] but it is forecasted that large parallel applications will have to manage fault rates of barely some minutes in exascale supercomputers [2]. In that sense, these applications require some help to progress efficiently.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…In the area of High-Performance Computing (HPC), parallel systems continue increasing the number of components to improve their performance and, as a consequence, ensuring their reliability has become a critical issue. Nowadays, fault rates involve just a few hours on modern platforms [1] but it is forecasted that large parallel applications will have to manage fault rates of barely some minutes in exascale supercomputers [2]. In that sense, these applications require some help to progress efficiently.…”
Section: Introductionmentioning
confidence: 99%
“…In an unrefined version, our method is an expensive approach, because it needs to keep an undetermined number of active checkpoints and may require several restart attempts. Instead, user-level checkpoints are becoming more usual, especially due to their lower costs and portability options [1].…”
mentioning
confidence: 99%
“…Attempts have been made to add resilience to PaR-SEC [4] and OmpSs [22]. Other work focuses on soft faults [4], i.e., they take advantage of the algorithmic properties of ABFT methods to detect and recover from failures at a fine grain (task level) and utilize periodic Fig.…”
Section: Task-based Resiliencementioning
confidence: 99%
“…Code generated from pragmas with end-to-end resilience for sequentially composed matrix multiplication (mmult is parallelized with OpenMP) checkpointing at a coarse grain (application). Yet others uses CR and message logging at the task granularity to tolerate faults with re-execution [22].…”
Section: Task-based Resiliencementioning
confidence: 99%
“…The work of Subasi et al [38] is based on programmer knowledge in order to achieve effective partial replication. The NanoCheckpoints [36] and the message logging protocol proposed by Martsinkevich et al [27] address fail-stop errors of task-parallel computations. SSD [37] is designed by using machine learning techniques to mitigate silent errors in HPC applications.…”
Section: Related Workmentioning
confidence: 99%