This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-order model. Then we will derive a more complete cost function and demonstrate a perturbation solution that provides accurate high order approximations to the optimum checkpoint interval.
Abstract. As the run time of an application approaches the the mean time to interrupt (MTTI) for the system on which it is running, it becomes necessary to generate intermediate snapshots of the application's run state, known as checkpoint files or restart dumps. In the event of a system failure that halts program execution, these snapshots allow an application to resume computing from the most recently saved intermediate state instead of starting over at the beginning of the calculation. In this paper three models for predicting the optimum compute intervals between restart dumps are discussed. These models are evaluated by comparing their results to a simulation that emulate an application running on a actual system with interrupts. The results will be used to derive a simple method for calculating the optimum restart interval.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.