Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared m e m o r y (DSM) an order t o tolerate single node failures. Although most recoverable D S M s require specific hardware t o store recovery data, our scheme uses standard memories t o store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a D S M in order t o limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56 nodes Intel Paragon.
Abstract. Mobile computing is going to change the way, computers are used today. However mobile computing environment has features like high mobility, frequent disconnections, and lack of resources, such as memory and battery power. Such features make applications, running on mobile devices, more susceptible to faults. Checkpointing is a major technique to confine faults and restart applications faster. In this paper, we present a coordinated checkpointing algorithm for deterministic applications. We are using anti-messages along-with selective logging to achieve faster recovery and reduced energy consumption. Our algorithm is non-blocking in nature and avoids unnecessary computation. We ask only minimum number of processes to take the checkpoint and also take in to account the limited storage available at mobile devices.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.