Alain Gefflaut scite author profile

This paper focuses on the problem of fault tolerance in shared memory multiprocessors, and describes an architecture designed for transparently tolerating processor failures. The Recoverable Shared Memory (RSM) is the novel component of this architecture, providing a hardware supported backward error recovery mechanism which minimizes the propagation of recovery when a processor fails. The RSM permits a shared memory multiprocessor to be constructed using standard caches and cache coherence protocols, and does not require any changes to be made to applications software. The performance of the recovery scheme supported by the RSM is evaluated and compared with other schemes that have been proposed for fault tolerant shared memory multiprocessors. The performance study has been conducted by simulation using address traces collected from real parallel applications.

show abstract

Unified Link Layer API: A generic and open API to manage wireless media access

Sooriyabandara

Farnham

Efthymiou

et al. 2008

Computer Communications

View full text Add to dashboard Cite

A recoverable distributed shared memory integrating coherence and recoverability

Kermarrec

Cabillic

Gefflaut

et al.

View full text Add to dashboard Cite

Large-scale distributed systems are very attractive for the execution of parallel applications requiring a huge computing power. However, their high probability of site failure is unacceptable, especially for long time running applications. In this paper, we address this problem and propose a checkpointing mechanism relying on a recoverable distributed shared m e m o r y (DSM) an order t o tolerate single node failures. Although most recoverable D S M s require specific hardware t o store recovery data, our scheme uses standard memories t o store both current and recovery data. Moreover, the management of recovery data is merged with the management of current data by extending the DSM's coherence protocol. This approach takes advantage of the data replication provided by a D S M in order t o limit the amount of transferred pages during the checkpointing. The paper also presents an implementation and a preliminary performance evaluation of our recoverable DSM on a 56 nodes Intel Paragon.

show abstract

An efficient and scalable approach for implementing fault-tolerant DSM architectures

Morin

Kermarrec

Banâtre

et al. 2000

IEEE Trans. Comput.

View full text Add to dashboard Cite

Shared Memory (DSM) architectures are attractive to execute high performance parallel applications. Made up of a large number of components, these architectures have however a high probability of failure. We propose a protocol to tolerate node failures in cache-based DSM architectures. The proposed solution is based on backward error recovery and consists of an extension to the existing coherence protocol to manage data used by processors for the computation and recovery data used for fault tolerance. This approach can be applied to both Cache Only Memory Architectures (COMA) and Shared Virtual Memory (SVM) systems. The implementation of the protocol in a COMA architecture has been evaluated by simulation. The protocol has also been implemented in an SVM system on a network of workstations. Both simulation results and measurements show that our solution is efficient and scalable.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Alain Gefflaut

The SawMill multiserver approach

An architecture for tolerating processor failures in shared-memory multiprocessors

Unified Link Layer API: A generic and open API to manage wireless media access

A recoverable distributed shared memory integrating coherence and recoverability

An efficient and scalable approach for implementing fault-tolerant DSM architectures

Contact Info

Product

Resources

About