Application-level checkpointing for shared memory programs

BronevetskyGreg,; MarquesDaniel,; PingaliKeshav,; SzwedPeter,; SchulzMartin,

doi:10.1145/1037949.1024421

Cited by 20 publications

(29 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Bronevetsky et al provide a source to source compiler tool that can automatically instruments the code to save and restore its own status. The tool coordinates checkpoints and restarts for parallel OpenMP [18], [19] and MPI programs [20]- [22].…”

Section: Related Workmentioning

confidence: 99%

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Section: Related Workmentioning

confidence: 99%

Deduplication Potential of HPC Applications’ Checkpoints

Kaiser

Gad

SuB

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

“…Blocking coordinated checkpointing with global barrier has been well used in OpenMP programs [20]. However, this checkpointing method has not been used in MPI programs.…”

Section: Related Workmentioning

confidence: 99%

WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

Yang

Lin

2012

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

SUMMARYAs supercomputers increase in size, the mean time between failures (MTBF) of a system becomes shorter, and the reliability problem of supercomputers becomes more and more serious. MPI is currently the de facto standard used to build high-performance applications, and researches on the fault tolerance methods of MPI are always hot topics. However, due to the characteristics of MPI programs, most current checkpointing methods for MPI programs need to modify the MPI library (even operating system), or implement a complicated protocol by logging lots of messages. In this paper, we carry forward the idea of Application-Level Checkpointing (ALC). Based on the general fact that programmers are familiar with the communication characteristics of applications, we have developed BC-ALC, a new portable blocking coordinated ALC for MPI programs. BC-ALC neither modifies the MPI library (even operating system) nor logs any message. It implements coordination only by the Barrier operations instead of any complicated protocol. Furthermore, in order to reduce the cost of fault-tolerance, we reduce the synchronization range of the barrier, and design WBC-ALC, a weak blocking coordinated ALC utilizing group synchronization instead of global synchronization based on the communication relationship between processes. We also propose a faulttolerance framework developed on top of WBC-ALC and discuss an implementation of it. Experimental results on NPB3.3-MPI benchmarks validate BC-ALC and WBC-ALC, and show that compared with BC-ALC, the average coordination time and the average backup time of a single checkpoint in WBC-ALC are reduced by 44.5% and 5.7% respectively.

show abstract

“…Thus, client services do not notice that the service has upgraded, except that client services of the new type may see improved performance and fewer rejected requests, and client services of the old type may see decreased performance and more rejected requests. We adopted checkpointing technology [4][5] [10][11] and process migration [12] technology to save the states of service states and recover the states for the new version service.…”

Section: States Managementmentioning

confidence: 99%

“…So the system updating procedure is an atom transaction [9]. The checkpointing technology [4][5] [10][11] and process migration [12] technology are adopted as state saving technology and state recovering.…”

Section: Updating Transactions Of Grid-based Systemmentioning

confidence: 99%

Research on Dynamic Updating of Grid Service

Huang

Wang

2007

Computational Science – ICCS 2007

View full text Add to dashboard Cite

Abstract. In complicated distributed system based on grid environment, the grid service is inadequate in the ability of runtime updating. While in the maintenance of systems in grid environment, it is an urgent issue to solve to support the transparent runtime updating of the services, especially in the case of services communicating with each other frequently. On the basis of researches on the implementation of grid services and interaction between them following WSRF [3], this paper introduces proxy service as the bridge of the interaction between services and achieved the ability to support the runtime dynamic updating of grid services. Gird service updating must happen gradually, and there may be long periods of time when different nodes run different service versions and need to communicate using incompatible protocols. We present a methodology and infrastructure that make it possible to upgrade grid-based systems automatically while limiting service disruption.

show abstract

Application-level checkpointing for shared memory programs

Cited by 20 publications

References 25 publications

Deduplication Potential of HPC Applications’ Checkpoints

Deduplication Potential of HPC Applications’ Checkpoints

WBC-ALC: A Weak Blocking Coordinated Application-Level Checkpointing for MPI Programs

Research on Dynamic Updating of Grid Service

Contact Info

Product

Resources

About