An architecture for tolerating processor failures in shared-memory multiprocessors

Banâtre, Michel; Gefflaut, Alain; Joubert, P.; Morin, Christine; Lee, P.A.

doi:10.1109/12.543705

Cited by 27 publications

(21 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While a line is in backup state its data is considered invalid and will be used only if required for recovery. Hence, the cache will not be able to read from that line 2 . Also, when a line enters in a backup state the lost data timeout will start and will stop once the backup state is abandoned.…”

Section: Avoiding Data Lossmentioning

confidence: 99%

“…Once C1 receives it, it transitions to a normal modified state. A cache line in a backup state will be used for recov- 2 It is possible for a cache to receive valid data and a token before abandoning a backup state, only if the data message was not lost. In that case, it will be able to read from that line and the line will be transitioned to an intermediate backup state until the ownership acknowledgement is received.…”

Section: Avoiding Data Lossmentioning

confidence: 99%

See 1 more Smart Citation

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

Fernández-Pascual

Garcia

Acacio

et al. 2007

2007 IEEE 13th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

Section: Avoiding Data Lossmentioning

confidence: 99%

Section: Avoiding Data Lossmentioning

confidence: 99%

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

Fernández-Pascual

Garcia

Acacio

et al. 2007

2007 IEEE 13th International Symposium on High Performance Computer Architecture

View full text Add to dashboard Cite

show abstract

“…5. We only present the results for the static web server, but these results are qualitatively the same for all of our other workloads.…”

Section: Discussionmentioning

confidence: 90%

“…Logging due to transferring cache ownership, however, does not incur additional bandwidth, since the cache line must be read anyway. In Figure 6, for the static web server workload 5 , we plot this frequency as a function of the checkpoint interval. Both axes use log scales.…”

Section: Sensitivity Analysesmentioning

confidence: 99%

“…The Sequoia system [7] uses caches to hold state between checkpoints, and flushes dirty cache blocks to memory at every checkpoint. Banâtre et al [5] describe a Recoverable Shared Memory module that requires a shadow copy of the entire memory and a mechanism for maintaining the interprocessor dependence graph. checkpointing has also been used, but at radically different engineering costs.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

Sorin

Martin

Hill

et al.

Proceedings 29th Annual International Symposium on Computer Architecture

167

206

View full text Add to dashboard Cite

We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consistent checkpoints of the state of a shared memory multiprocessor (i.e., processors, memory, and coherence permissions), and it recovers to a pre-fault checkpoint of the system and re-executes if a fault is detected. SafetyNet efficiently coordinates checkpoints across the system in logical time and uses "logically atomic" coherence transactions to free checkpoints of transient coherence state. SafetyNet minimizes performance overhead by pipelining checkpoint validation with subsequent parallel execution.We illustrate SafetyNet avoiding system crashes due to either dropped coherence messages or the loss of an interconnection network switch (and its buffered messages). Using full-system simulation of a 16-way multiprocessor running commercial workloads, we find that SafetyNet (a) adds statistically insignificant runtime overhead in the common-case of fault-free execution, and (b) avoids a crash when tolerated faults occur. Architecture, 2002, May 2002 This material is posted here with permission of the IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of the University of Pennsylvania's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it. Comments Copyright 2002 IEEE. Reprinted from Proceedings of the 29th Annual International Symposium on Computer

show abstract

Status and Trends in the Performance Assessment of Fault Tolerant Systems

Kontoleon

Handbook of Performability Engineering

View full text Add to dashboard Cite

An architecture for tolerating processor failures in shared-memory multiprocessors

Cited by 27 publications

References 28 publications

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

A Low Overhead Fault Tolerant Coherence Protocol for CMP Architectures

SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery

Status and Trends in the Performance Assessment of Fault Tolerant Systems

Contact Info

Product

Resources

About