Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Coti,; Herault,; Lemarinier, Pierre; Pilard,; Rezmerita,; Rodriguezb,; Cappello,

doi:10.1109/sc.2006.15

Cited by 40 publications

(24 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the case of application-level CR, a synchronization point is typically used right before checkpointing in order to guarantee that all messages have been consumed. A similar technique is also widely leveraged in practice for system level checkpointing that uses a coordinated protocol [14], both for the blocking and non-blocking case. More recently, uncoordinated checkpointing protocols, which previously received little attention in practice due the cost and complexity introduced by message logging [6] have been increasingly considered for certain classes of HPC applications [23].…”

Section: How To Capture the State Of The Applicationmentioning

confidence: 99%

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Nicolae

Cappello

2013

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back file system changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a reallife HPC application.

show abstract

Section: How To Capture the State Of The Applicationmentioning

confidence: 99%

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Nicolae

Cappello

2013

Journal of Parallel and Distributed Computing

Self Cite

View full text Add to dashboard Cite

show abstract

“…This algorithm uses markers to coordinate the backup, and operates under the assumption of FIFO channels. In [3], a comparison of protocols for coordinated checkpoint blocking and non-blocking has been made. Experiments have shown that the synchronization between nodes induced by the protocol blocking further penalize the performance of the calculation with a non-blocking protocol.…”

Section: ) Non-blocking Coordinated Checkpointingmentioning

confidence: 99%

Performance comparison of hierarchical checkpoint protocols grid computing

Ndiaye¹,

Sens²,

Thiaré³

2012

IJIMAI

View full text Add to dashboard Cite

-46- Abstract -Grid infrastructure is a large set of nodes geographically distributed and connected by a communication. In this context, fault tolerance is a necessity imposed by the distribution that poses a number of problems related to the heterogeneity of hardware, operating systems, networks, middleware, applications, the dynamic resource, the scalability, the lack of common memory, the lack of a common clock, the asynchronous communication between processes. To improve the robustness of supercomputing applications in the presence of failures, many techniques have been developed to provide resistance to these faults of the system. Fault tolerance is intended to allow the system to provide service as specified in spite of occurrences of faults. It appears as an indispensable element in distributed systems. To meet this need, several techniques have been proposed in the literature. We will study the protocols based on rollback recovery. These protocols are classified into two categories: coordinated checkpointing and rollback protocols and log-based independent checkpointing protocols or message logging protocols. However, the performance of a protocol depends on the characteristics of the system, network and applications running. Faced with the constraints of large-scale environments, many of algorithms of the literature showed inadequate. Given an application environment and a system, it is not easy to identify the recovery protocol that is most appropriate for a cluster or hierarchical environment, like grid computing. While some protocols have been used successfully in small scale, they are not suitable for use in large scale. Hence there is a need to implement these protocols in a hierarchical fashion to compare their performance in grid computing. In this paper, we propose hierarchical version of four well-known protocols. We have implemented and compare the performance of these protocols in clusters and grid computing using the Omnet++ simulator.

show abstract

“…However, most of the researches of checkpointing for MPI programs are focusing on coordinated checkpointing. [12] compared blocking with non-blocking coordinated checkpointing for largescale fault tolerant MPI programs. The authors found out that for high-speed networks, the blocking implementation gives the best performance for sensible checkpoint frequency.…”

Section: Related Workmentioning

confidence: 99%

WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

Yang

Lin

2012

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYAs supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a faulttolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.

show abstract

Blocking vs. Non-Blocking Coordinated Checkpointing for Large-Scale Fault Tolerant MPI

Cited by 40 publications

References 18 publications

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Performance comparison of hierarchical checkpoint protocols grid computing

WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

Contact Info

Product

Resources

About