VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Leblanc, Troy; Anand, Rakhi; Gabriel, Edgar; Subhlok, Jaspal

doi:10.1007/978-3-642-03770-2_19

Cited by 39 publications

(36 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In contrast, this work (i) considers dependent tasks such as found in applications consisting of linear workflows; and (ii) proposes an optimal dynamic programming algorithm to solve the selective replication and checkpointing problem. Combining replication with checkpointing has also been proposed in [29,41,16] for HPC platforms, and in [22,37] for grid computing.…”

Section: Replicationmentioning

confidence: 99%

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Benoît

Cavelan

Ciorba

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

This report combines checkpointing and replication for the reliable execution of linear workflows. While both methods have been studied separately, their combination has not yet been investigated despite its promising potential to minimize the execution time of linear workflows in failure-prone environments. The combination raises new problems: for each task, we have to decide whether to checkpoint and/or replicate it. We provide an optimal dynamic programming algorithm of quadratic complexity to solve both problems. This dynamic programming algorithm has been validated through extensive simulations that reveal the conditions in which checkpointing only, replication only, or the combination of both techniques lead to improved performance.

show abstract

Section: Replicationmentioning

confidence: 99%

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Benoît

Cavelan

Ciorba

et al. 2018

2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

View full text Add to dashboard Cite

show abstract

“…Recent advances include multi-level approaches, or the use of SSD or NVRAM as secondary storage [14]. Combining replication with checkpointing has been proposed in [41,49,25] for HPC platforms, and in [33,46] for grid computing.…”

Section: Replication For Fail-stop Errorsmentioning

confidence: 99%

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Benoît

Cavelan

Cappello

et al. 2018

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

“…The idea of node duplication via PMPI has been used in the fault tolerance community, particularly r MPI [7], MR-MPI [6] and VolPEX [11]. Here, duplication ensures that if any particular node goes down its duplicate will step in to allow execution to continue without interruption.…”

Section: Related Workmentioning

confidence: 99%

Parallelizing heavyweight debugging tools with mpiecho

et al. 2013

View full text Add to dashboard Cite

Idioms created for debugging execution on single processors and multicore systems have been successfully scaled to thousands of processors, but there is little hope that this class of techniques can continue to be scaled out to tens of millions of cores. In order to allow development of more scalable debugging idioms we introduce MPIecho, a novel runtime platform that enables cloning of MPI ranks. Given identical execution on each clone, we then show how heavyweight debugging approaches can be parallelized, reducing their overhead to a fraction of the serialized case. We also show how this platform can be useful in isolating the source of hardwarebased nondeterministic behavior and provide a case study based on a recent processor bug at LLNL.While total overhead will depend on the individual tool, we show that the platform itself contributes little: 512x tool parallelization incurs at worst 2x overhead across the NAS Parallel benchmarks, hardware fault isolation contributes at worst an additional 44% overhead. Finally, we show how MPIecho can lead to near-linear reduction in overhead when combined with Maid, a heavyweight memory tracking tool provided with Intel's Pin platform. We demonstrate overhead reduction from 1, 466% to 53% and from 740% to 14% for cg.D.64 and lu.D.64, respectively, using only an additional 64 cores.

show abstract

VolpexMPI: An MPI Library for Execution of Parallel Applications on Volatile Nodes

Cited by 39 publications

References 10 publications

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Combining Checkpointing and Replication for Reliable Execution of Linear Workflows

Coping with silent and fail-stop errors at scale by combining replication and checkpointing

Parallelizing heavyweight debugging tools with mpiecho

Contact Info

Product

Resources

About