Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers
DOI: 10.1109/ftcs.1995.466970
|View full text |Cite
|
Sign up to set email alerts
|

A recoverable distributed shared memory integrating coherence and recoverability

Abstract: Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared m e m o r y (DSM) an order t o tolerate single node failures. Although most recoverable D S M s require specific hardware t o store recovery data, our sc… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 38 publications
(17 citation statements)
references
References 18 publications
0
17
0
Order By: Relevance
“…Future work includes the study and experimentation of a larger set of memory hierarchy management strategies as well as a complete rollback implementation including the processes' private context. The scalability of the SVM part of HA-PSLS has been shown in [23]; we are currently extending our prototype to evaluate the scalability of HA-PSLS as well as the impact of injecting realistic faults. The studied protocols are currently being implemented in the Gobelins cluster single system image operating system [24], which runs on a cluster based on standard networking technologies (Fast Ethernet, Gigabit Ethernet, Myrinet).…”
Section: Resultsmentioning
confidence: 98%
“…Future work includes the study and experimentation of a larger set of memory hierarchy management strategies as well as a complete rollback implementation including the processes' private context. The scalability of the SVM part of HA-PSLS has been shown in [23]; we are currently extending our prototype to evaluate the scalability of HA-PSLS as well as the impact of injecting realistic faults. The studied protocols are currently being implemented in the Gobelins cluster single system image operating system [24], which runs on a cluster based on standard networking technologies (Fast Ethernet, Gigabit Ethernet, Myrinet).…”
Section: Resultsmentioning
confidence: 98%
“…Conversely, backup replicas created for fault tolerance can be used by the consistency protocol. This approach has a major disadvantage: the design of the corresponding software layer is very complex, as illustrated by some fault-tolerant DSM systems [16,17].…”
Section: Introductionmentioning
confidence: 99%
“…More relevant for our work is the survey of recoverable distributed shared virtual mem- ory systems presented in [21]. Previous work that has examined various aspects of recovery in software shared memory systems includes [27,10,31,17,1,18,26]. In all these cases, the focus has been on protocol extensions for logging and checkpointing that enable coarse-grain system recovery.…”
Section: Related Workmentioning
confidence: 99%