Automated application-level checkpointing of MPI programs

BronevetskyGreg,; MarquesDaniel,; PingaliKeshav,; StodghillPaul,

doi:10.1145/966049.781513

Cited by 60 publications

(57 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Our approach is motivated by the observation that on today's ever-larger systems, checkpointing has become a standard fault tolerance practice, in answer to the shortening MTBF (mean time between failures). With portable, optimized tools such as BLCR [17], parallel file systems designed with checkpointing as a major workload or even specifically for the checkpointing purpose (such as PLFS [18]), emerging high-performance hardware such as aggregated SSD storage, and a large amount of recent/ongoing research on checkpointing [19][20][21], the efficiency and scalability of job checkpointing has been improving significantly. Such growing checkpoint capability on large-scale systems enables us to relax the backfill conditions, by allowing jobs to be aggressively backfilled, and suspended for later execution if resources are due for advance reservation.…”

Section: Cdf (%)mentioning

confidence: 99%

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Niu¹,

Zhai²,

Ma³

et al. 2013

Job Scheduling Strategies for Parallel Processing

View full text Add to dashboard Cite

Abstract. The FCFS-based backfill algorithm is widely used in scheduling high-performance computer systems. The algorithm relies on runtime estimate of jobs which is provided by users. However, statistics show the accuracy of user-provided estimate is poor. Users are very likely to provide a much longer runtime estimate than its real execution time. In this paper, we propose an aggressive backfilling approach with checkpoint based preemption to address the inaccuracy in user-provided runtime estimate. The approach is evaluated with real workload traces. The results show that compared with the FCFS-based backfill algorithm, our scheme improves the job scheduling performance in waiting time, slowdown and mean queue length by up to 40%. Meanwhile, only 4% of the jobs need to perform checkpoints.

show abstract

Section: Cdf (%)mentioning

confidence: 99%

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Niu¹,

Zhai²,

Ma³

et al. 2013

Job Scheduling Strategies for Parallel Processing

View full text Add to dashboard Cite

show abstract

“…Such schemes are used by CoCheck [23], Starfish [1], Clip [10] and AMPI [15,26] to provide fault tolerant versions of MPI. A coordinated checkpointing algorithm that uses application level checkpointing is presented in [8]. Communication induced checkpoint protocols try to combine the advantages of coordinated and uncoordinated by allowing processors to take a mix of independent and coordinated checkpoints [7].…”

Section: Related Workmentioning

confidence: 99%

A Fault Tolerance Protocol with Fast Fault Recovery

Chakravorty

Kalé

2007

2007 IEEE International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Fault tolerance is an important issue for large machines with tens or hundreds of thousands of processors. Checkpoint-based methods, currently used on most machines, rollback all processors to previous checkpoints after a crash. This wastes a significant amount of computation as all processors have to redo all the computation from that checkpoint onwards. In addition, recovery time is bound by the time between the last checkpoint and the crash. Protocols based on message logging avoid the problem of rolling back all processors to their earlier state. However, the recovery time of existing message logging protocols is no smaller than the time between the last checkpoint and crash. We present a fault tolerance protocol, in this paper, that provides fast restarts by using the ideas of message logging and object-based processor virtualization. We evaluate our implementation of the protocol in the Charm++/Adaptive MPI runtime system. We show that our protocol provides fast restarts and, for many applications, has low fault-free overhead.

show abstract

“…System-level checkpoints at remote storage cause large amounts of data to be sent through the network, but applicationlevel checkpoints require modifications of the application code, and as such are not completely transparent to the programmer, in the sense that a code written for a non-fault-tolerant implementation of MPI requires some modifications to be executed on a fault-tolerant implementation of MPI using application-level checkpoints [Schulz et al 2004] [Bronevetsky et al 2003]. …”

Section: Related Workmentioning

confidence: 99%

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Coti¹,

Herault²,

Lemarinier³

et al. 2006

ACM/IEEE SC 2006 Conference (SC'06)

View full text Add to dashboard Cite

A long-term trend in high-performance computing is the increasing number of nodes in parallel computing platforms, which entails a higher failure probability. Fault tolerant programming environments should be used to guarantee the safe execution of critical applications. Research in fault tolerant MPI has led to the development of several fault tolerant MPI environments. Different approaches are being proposed using a variety of fault tolerant message passing protocols based on coordinated checkpointing or message logging. The most popular approach is with coordinated checkpointing. In the literature, two different concepts of coordinated checkpointing have been proposed: blocking and nonblocking. However they have never been compared quantitatively and their respective scalability remains unknown. The contribution of this paper is to provide the first comparison between these two approaches and a study of their scalability. We have implemented the two approaches within the MPICH environments and evaluate their performance using the NAS parallel benchmarks.

show abstract

Automated application-level checkpointing of MPI programs

Cited by 60 publications

References 16 publications

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

Employing Checkpoint to Improve Job Scheduling in Large-Scale Systems

A Fault Tolerance Protocol with Fast Fault Recovery

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Contact Info

Product

Resources

About