2004
DOI: 10.1109/tdsc.2004.4
|View full text |Cite
|
Sign up to set email alerts
|

Commercial fault tolerance: a tale of two systems

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
59
0
2

Year Published

2008
2008
2023
2023

Publication Types

Select...
4
3
2

Relationship

0
9

Authors

Journals

citations
Cited by 120 publications
(61 citation statements)
references
References 15 publications
0
59
0
2
Order By: Relevance
“…DIVA [3] uses a simple in-order core as a checker for an out-of-order core. Triple redundancy systems are used in commercial processors (i.e., HP NonStop architecture [7]) and "Pair & spare" systems [5] and can achieve 0 DUE without roll-back. The work of [48] shows how to handle the DUE problem in L1 caches.…”
Section: Related Workmentioning
confidence: 99%
“…DIVA [3] uses a simple in-order core as a checker for an out-of-order core. Triple redundancy systems are used in commercial processors (i.e., HP NonStop architecture [7]) and "Pair & spare" systems [5] and can achieve 0 DUE without roll-back. The work of [48] shows how to handle the DUE problem in L1 caches.…”
Section: Related Workmentioning
confidence: 99%
“…In a reconfigurable architecture, recovery entails isolating defective module(s) and incorporating spare structures as needed. Support for reconfiguration can be achieved at various granularities, from ultrafine grain systems [7,8] that have the ability to replace individual logic gates to coarser designs that focus on isolating entire processor cores [1,2,[9][10][11][12][13][14]21]. This choice presents a trade-off between complexity of implementation and potential lifetime enhancement [15,16].…”
Section: Related Workmentioning
confidence: 99%
“…For instance, lockstep [5], DIVA [2] and redundant multithreading either in a single SMT core [20] or in separate cores [15] are examples of coarse-grain concurrent testing. Most of those techniques do not replicate cache accesses [2,5,15,20], and thus, those errors not detected by parity or ECC are neither detected by those reexecution mechanisms. Only some implementations of lockstep [5] detect such errors, but the cost is huge in power (more than 2X), area (two cores are required to execute a single program) and performance.…”
Section: Related Workmentioning
confidence: 99%
“…Most of those techniques do not replicate cache accesses [2,5,15,20], and thus, those errors not detected by parity or ECC are neither detected by those reexecution mechanisms. Only some implementations of lockstep [5] detect such errors, but the cost is huge in power (more than 2X), area (two cores are required to execute a single program) and performance. Moreover, errors are not confined so further techniques are required to identify the faulty component.…”
Section: Related Workmentioning
confidence: 99%