2008
DOI: 10.1109/ipdps.2008.4536435
|View full text |Cite
|
Sign up to set email alerts
|

Efficient software checking for fault tolerance

Abstract: As semiconductor technology scales into the deep submicron regime the occurrence of transient or soft errors will increase. This will require new approaches to error detection. Software checking approaches are attractive because they require little hardware modification and can be easily adjusted to fit different reliability and performance requirements. Unfortunately, software checking adds a significant performance overhead.In order to make software checking system more attractive, this dissertation proposes… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2013
2013
2018
2018

Publication Types

Select...
2
2

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(2 citation statements)
references
References 47 publications
(92 reference statements)
0
2
0
Order By: Relevance
“…The rollback is the ability to return to a previously valid state of the processor in the case, for instance, of an execution error or a power failure. We assume that an error detection mechanism is available in the processor architecture to identify errors during execution as proposed, for instance, in Yu et al [2008] or Wali et al [2016]. The principle of the rollback is shown in Figure 7.…”
Section: Rollbackmentioning
confidence: 99%
“…The rollback is the ability to return to a previously valid state of the processor in the case, for instance, of an execution error or a power failure. We assume that an error detection mechanism is available in the processor architecture to identify errors during execution as proposed, for instance, in Yu et al [2008] or Wali et al [2016]. The principle of the rollback is shown in Figure 7.…”
Section: Rollbackmentioning
confidence: 99%
“…For example, the BlueGene/L experiences one soft error in its L1 cache every 4-6 hours [1]. All these factors make that the massive parallel CFD applications are more vulnerable to the failure attack [2], [3]. Checkpoint/Restart technology is a widely used fault tolerant (FT) method, which periodically backups the intermediate result to the stable storage, and rollbacks to the nearest checkpoint when a failure occurs.…”
Section: Introductionmentioning
confidence: 99%