Commercial fault tolerance: a tale of two systems

Bartlett, W.; Spainhower, Lisa

doi:10.1109/tdsc.2004.4

Cited by 120 publications

(61 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…DIVA [3] uses a simple in-order core as a checker for an out-of-order core. Triple redundancy systems are used in commercial processors (i.e., HP NonStop architecture [7]) and "Pair & spare" systems [5] and can achieve 0 DUE without roll-back. The work of [48] shows how to handle the DUE problem in L1 caches.…”

Section: Related Workmentioning

confidence: 99%

Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

Upasani

Vera

González

2014

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

Section: Related Workmentioning

confidence: 99%

Avoiding core's DUE & SDC via acoustic wave detectors and tailored error containment and recovery

Upasani

Vera

González

2014

SIGARCH Comput. Archit. News

View full text Add to dashboard Cite

show abstract

“…In a reconfigurable architecture, recovery entails isolating defective module(s) and incorporating spare structures as needed. Support for reconfiguration can be achieved at various granularities, from ultrafine grain systems [7,8] that have the ability to replace individual logic gates to coarser designs that focus on isolating entire processor cores [1,2,[9][10][11][12][13][14]21]. This choice presents a trade-off between complexity of implementation and potential lifetime enhancement [15,16].…”

Section: Related Workmentioning

confidence: 99%

Hierarchical Multiplexing Interconnection Structure for Fault-Tolerant Reconfigurable Chip Multiprocessor

Kim

2011

JSTS:Journal of Semiconductor Technology and Science

View full text Add to dashboard Cite

Abstract-Stage-level reconfigurable chip multiprocessor (CMP) aims to achieve highly reliable and fault tolerant computing by using interwoven pipeline stages and on-chip interconnect for communicating with each other. The existing crossbar-switch based stage-level reconfigurable CMPs offer high reliability at the cost of significant area/power overheads. These overheads make realizing large CMPs prohibitive due to the area and power consumed by heavy interconnection networks. On other hand, area/ power-efficient architectures offer less reliability and inefficient stage-level resource utilization. In this paper, I propose a hierarchical multiplexing interconnection structure in lieu of crossbar interconnect to design area/power-efficient stage-level reconfigurable CMP. The proposed approach is able to keep the reliability offered by the crossbar-switch while reducing the area and power overheads. Experimental results show that the proposed approach reduces area by up to 21% and power by up to 32% when compared with the crossbar-switch based interconnection network.

show abstract

“…For instance, lockstep [5], DIVA [2] and redundant multithreading either in a single SMT core [20] or in separate cores [15] are examples of coarse-grain concurrent testing. Most of those techniques do not replicate cache accesses [2,5,15,20], and thus, those errors not detected by parity or ECC are neither detected by those reexecution mechanisms. Only some implementations of lockstep [5] detect such errors, but the cost is huge in power (more than 2X), area (two cores are required to execute a single program) and performance.…”

Section: Related Workmentioning

confidence: 99%

“…Most of those techniques do not replicate cache accesses [2,5,15,20], and thus, those errors not detected by parity or ECC are neither detected by those reexecution mechanisms. Only some implementations of lockstep [5] detect such errors, but the cost is huge in power (more than 2X), area (two cores are required to execute a single program) and performance. Moreover, errors are not confined so further techniques are required to identify the faulty component.…”

Section: Related Workmentioning

confidence: 99%