DMT and DT2: Two Fault-Tolerant Architectures developed by CNES for COTs-based Spacecraft Supercomputers

Pignol, M.

doi:10.1109/iolts.2006.24

Cited by 30 publications

(11 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The implementation of synchronized lockstep combined with checkpoints and rollback recovery presented in this paper was inspired in the approaches proposed in [10] and [11], and it is an extension of the implementation presented in [13]. It has been conceived to harden processor cores embedded in FPGA devices against soft errors affecting the internal memory elements of the processors, and has been initially implemented using a Xilinx Virtex II Pro FPGA, which embeds two 32-bit IBM Power PC 405 hard processor cores.…”

Section: The Proposed Implementationmentioning

confidence: 99%

“…Other researchers explored alternative paths to hardware redundancy, which consisted basically in duplicating the system's processor and inserting special monitor modules that check whether the duplicated processors execute the same operations [10], [11]. These approaches are particularly appealing in those cases where processor duplication does not impact severely the hardware cost.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors

Abate

Sterpone

Lisboa

et al. 2009

IEEE Trans. Nucl. Sci.

View full text Add to dashboard Cite

Abstract-The growing availability of embedded processors inside FPGAs provides unprecedented flexibility for system designers. The use of such devices for space or mission critical applications, however, is being delayed by the lack of effective low cost techniques to mitigate radiation induced errors. In this paper a non invasive approach for the implementation of fault tolerant systems based on COTS processors embedded in FPGAs, using lockstep in conjunction with checkpoint and rollback recovery, is presented. The proposed approach does not require modifications in the processor architecture or in the application software. The experimental validation of this approach through fault injection is described, the corresponding results are discussed, and the addition of a write history table as a means to reduce the performance overhead imposed by previous implementations is proposed and evaluated.

show abstract

Section: The Proposed Implementationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors

Abate

Sterpone

Lisboa

et al. 2009

IEEE Trans. Nucl. Sci.

View full text Add to dashboard Cite

show abstract

“…The duplicated executions in [1] are generated by hardware support units in the processor, which are then compared in a separate unit to detect errors. Pignol proposed another approach using task-level redundancy for error detection and tolerance in DMT and DT2 architectures [12]. In this approach, error detection is achieved through re-execution of the computation tasks using a memory bridge, followed by comparisons of the results.…”

Section: Introductionmentioning

confidence: 99%

“…The above techniques require designers' direct intervention in hardware design [1], [12] or software compilers [10], [16] to incorporate desired error detection and tolerance capabilities. However, in many COTS based systems, this may not be feasible due to intellectual property (IP) rights and cost control.…”

Section: Introductionmentioning

confidence: 99%

Software Modification Aided Transient Error Tolerance for Embedded Systems

Shafik

Rauwerda

Potman

et al. 2013

2013 Euromicro Conference on Digital System Design

View full text Add to dashboard Cite

Abstract-Commercial off-the-shelf (COTS) components are increasingly being employed in embedded systems due to their high performance at low cost. With emerging reliability requirements, design of these components using traditional hardware redundancy incur large overheads, timedemanding re-design and validation. To reduce the design time with shorter time-to-market requirements, softwareonly reliable design techniques can provide with an effective and low-cost alternative. This paper presents a novel, architecture-independent software modification tool, SMART (Software Modification Aided transient eRror Tolerance) for effective error detection and tolerance. To detect transient errors in processor datapath, control flow and memory at reasonable system overheads, the tool incorporates selective and non-intrusive data duplication and dynamic signature comparison. Also, to mitigate the impact of the detected errors, it facilitates further software modification implementing software-based check-pointing. Due to automatic software based source-to-source modification tailored to a given reliability requirement, the tool requires no re-design effort, hardware-or compiler-level intervention. We evaluate the effectiveness of the tool using a Xentium R processor based system as a case study of COTS based systems. Using various benchmark applications with single-event upset (SEUs) based error model, we show that up to 91% of the errors can be detected or masked with reasonable performance, energy and memory footprint overheads.

show abstract

“…Therefore, to achieve low performance overhead during normal operation, as well as fast recovery, the minimum transfer time for those operations must be obtained, together with a low implementation cost. As an example of task-level fault detection scheme, we can consider the approach presented in (Pignol, 2006). In the DT2 architecture (Pignol, 2006), two processors execute in parallel the same task, as in the lockstep architecture.…”

Section: Active Redundancymentioning

confidence: 99%

Advanced Technologies for Transient Faults Detection and Compensation

Reorda

Sterpone

Violante

Design and Test Technology for Dependable Systems-on-Chip

View full text Add to dashboard Cite

Transient faults became an increasing issue in the past few years as smaller geometries of newer, highly miniaturized, silicon manufacturing technologies brought to the mass-market failure mechanisms traditionally bound to niche markets as electronic equipments for avionic, space or nuclear applications. This chapter presents the origin of transient faults, it discusses the propagation mechanism, it outlines models devised to represent them and finally it discusses the state-of-the-art design techniques that can be used to detect and correct transient faults. The concepts of hardware, data and time redundancy are presented, and their implementations to cope with transient faults affecting storage elements, combinational logic and IP-cores (e.g., processor cores) typically found in a System-on-Chip are discussed.Single Event Transient (SET) is the not-destructive event that takes place when the parasitic current produces glitches on the values of nets in the circuit compatible with the noise margins of the technology, thus result in the temporary modification of the value of the nets from 0 to 1, or vice-versa. Among SEEs, SEL is the most worrisome, as it corresponds to the destruction of the device, and hence it is normally solved by means of SEL-aware layout of silicon cells, or by current sensing and limiting circuits. SEUs, MBUs, and SETs can be tackled in different ways, depending on the market the application aims at. When vertical, high-budget, applications are considered, like for example electronic devices for telecom satellites, SEE-immune manufacturing technologies can be adopted, which are byconstruction immune to SEUs, MBUs, and SETs, but whose costs are prohibitive for any other market. When budget-constrained applications are considered, from electronic devices for space exploration missions to automotive and commodity applications, SEUs, MBUs and SETs should be tackled by adopting fault detection and compensation techniques that allow developing dependable systems (i.e., where SEE effects produce negligible impacts on the application end user) on top of intrinsically not dependable technologies (i.e., which can be subject to SEUs, MBUs, and SETs), whose manufacturing costs are affordable. Different types of fault detection and compensation techniques have been developed in the past years, which are based on the well-known concepts of resource, information or time redundancy (Pradhan, 1996). In this chapter we first look at the source of soft errors, by presenting some background on radioactive environments, and then discussing how soft errors can be seen at the device level. When then present the most interesting mitigation techniques organized as a function of the component they aims at: processor, memory module, and random logic. Finally, we draw some conclusions. BACKGROUNDThe purpose of this section is to present an overview of the radioactive environments, to introduce the reader to the physical roots of soft errors. Afterwards, SEEs resulting from the interaction of ionizing radiation with th...

show abstract

DMT and DT2: Two Fault-Tolerant Architectures developed by CNES for COTs-based Spacecraft Supercomputers

Cited by 30 publications

References 17 publications

New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors

New Techniques for Improving the Performance of the Lockstep Architecture for SEEs Mitigation in FPGA Embedded Processors

Software Modification Aided Transient Error Tolerance for Embedded Systems

Advanced Technologies for Transient Faults Detection and Compensation

Contact Info

Product

Resources

About