A recoverable distributed shared memory integrating coherence and recoverability

Kermarrec, Anne-Marie; Cabillic, Gilbert; Gefflaut, Alain; Morin, Christine; Puaut, Isabelle

doi:10.1109/ftcs.1995.466970

Cited by 38 publications

(17 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Future work includes the study and experimentation of a larger set of memory hierarchy management strategies as well as a complete rollback implementation including the processes' private context. The scalability of the SVM part of HA-PSLS has been shown in [23]; we are currently extending our prototype to evaluate the scalability of HA-PSLS as well as the impact of injecting realistic faults. The studied protocols are currently being implemented in the Gobelins cluster single system image operating system [24], which runs on a cluster based on standard networking technologies (Fast Ethernet, Gigabit Ethernet, Myrinet).…”

Section: Resultsmentioning

confidence: 98%

HA‐PSLS: a highly available parallel single‐level store system

Kermarrec

Morin

2003

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYParallel single-level store (PSLS) systems integrate a shared virtual memory and a parallel file system. They provide programmers with a global address space including both memory and file data. PSLS systems implemented in a cluster thus represent a natural support for long-running parallel applications, combining both the natural shared memory programming model and a large and efficient file system. However, the need to tolerate failures in such a system increases with the size of applications. In this paper we present a highly-available parallel single level store system (HA-PSLS), which smoothly integrates a backward error recovery high-availability mechanism into a PSLS system. Our system is able to tolerate multiple transient failures, a single permanent failure, and power cut failures affecting the whole cluster, without requiring any specialized hardware. For this purpose, HA-PSLS relies on a high degree of integration (and reusability) of high-availability and standard features. A prototype integrating our highavailability support has been implemented and we show some performance results in the paper. Copyright

show abstract

Section: Resultsmentioning

confidence: 98%

HA‐PSLS: a highly available parallel single‐level store system

Kermarrec

Morin

2003

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Conversely, backup replicas created for fault tolerance can be used by the consistency protocol. This approach has a major disadvantage: the design of the corresponding software layer is very complex, as illustrated by some fault-tolerant DSM systems [16,17].…”

Section: Introductionmentioning

confidence: 99%

How to bring together fault tolerance and data consistency to enable Grid data sharing

Antoniu

Deverge

2006

Concurrency and Computation

View full text Add to dashboard Cite

This paper addresses the challenge of transparent data sharing within computing Grids built as cluster federations. On such platforms, the availability of storage resources may change in a dynamic way, often due to hardware failures. We focus on the problem of handling the consistency of replicated data in the presence of failures. We propose a software architecture which decouples consistency management from fault tolerance management. We illustrate this architecture with a case study showing how to design a consistency protocol using fault‐tolerant building blocks. As a proof of concept, we describe a prototype implementation of this protocol within JUXMEM, a software experimental platform for Grid data sharing, and we report on a preliminary experimental evaluation of the proposed approach. Copyright © 2006 John Wiley & Sons, Ltd.

show abstract

“…More relevant for our work is the survey of recoverable distributed shared virtual mem- ory systems presented in [21]. Previous work that has examined various aspects of recovery in software shared memory systems includes [27,10,31,17,1,18,26]. In all these cases, the focus has been on protocol extensions for logging and checkpointing that enable coarse-grain system recovery.…”

Section: Related Workmentioning

confidence: 99%

Fast and transparent recovery for continuous availability of cluster-based servers

Christodoulopoulou

Manassiev

Bilas

et al. 2006

Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Recently there has been renewed interest in building reliable servers that support continuous application operation. Besides maintaining system state consistent after a failure, one of the main challenges in achieving continuous operation is to provide fast reconfiguration. The complexity of the failure reconfiguration mechanisms employed and their overheads depend on the type of platform that is being used as a server and the types of applications that need to be supported. In this paper we focus on providing support for shared-memory applications running on clusters of commodity nodes and interconnects. Achieving continuous operation for shared memory applications on clusters presents two main challenges. (a) The fault tolerance mechanisms employed should be transparent to applications and should have low overhead during failure-free execution. (b) When failures occur, reconfiguration should occur with minimum application disruption without requiring the full recovery of the failed node.In this work we examine in detail the latter, i.e., (b), the failure reconfiguration path. We use a previously developed system [8] that achieves (a) by using dynamic replication of data to the memories of multiple nodes of the system during execution. We examine in detail how the runtime system can achieve minimum application interruption, when failures occur. We present the design and implementation of FineFRC (Fine-grained Failure Reconfiguration on Clusters), a runtime system for achieving continuous operation of shared memory applications on commodity clusters without requiring application instrumentation or human intervention. We present results using a working, 16-processor system that achieves subsecond failure reconfiguration times.

show abstract

A recoverable distributed shared memory integrating coherence and recoverability

Cited by 38 publications

References 18 publications

HA‐PSLS: a highly available parallel single‐level store system

HA‐PSLS: a highly available parallel single‐level store system

How to bring together fault tolerance and data consistency to enable Grid data sharing

Fast and transparent recovery for continuous availability of cluster-based servers

Contact Info

Product

Resources

About