Reli: Hardware/software Checkpoint and Recovery scheme for embedded processors

Li, Qing; Ragel,; Parameswaran,

doi:10.1109/date.2012.6176621

Cited by 7 publications

(4 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Prior works [35,45] have also explored symptombased soft-error detection/recovery mechanisms, but they provide low soft-error coverage, since they rely on coarse-grain detectors, such as fatal-traps, hangs, panics, and so on. Under hardwarebased resilience schemes [35,39,45], the solutions enable redundancy mechanisms, such as TLR [4,11,17,27] or nMR [41,44] to provide soft-error protection. For instance, prior work [44] focuses on applying DMR on a multicore (GPU) setting, where it redundantly executes two copies of the same application, and delivers high soft-error coverage by performing cross checks in a duplicated thread.…”

Section: Related Workmentioning

confidence: 99%

Prism

Omar

Khan²

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Multicores increasingly deploy safety-critical parallel applications that demand resiliency against soft-errors to satisfy the safety standards. However, protection against these errors is challenging due to complex communication and data access protocols that aggressively share on-chip hardware resources. Research has explored various temporal and spatial redundancy-based resiliency schemes that provide multicores with high soft-error coverage. However, redundant execution incurs performance overheads due to interference effects induced by aggressive resource sharing. Moreover, these schemes require intrusive hardware modifications and fall short in providing efficient system availability guarantees. This article proposes PRISM, a resilient multicore architecture that incorporates strong hardware isolation to form redundant clusters of cores, ensuring a non-interference-based redundant execution environment. A soft error in one cluster does not effect the execution of the other cluster, resulting in high system availability. Implementing strong isolation for shared hardware resources, such as queues, caches, and networks requires logic for partitioning. However, it is less intrusive as complex hardware modifications to protocols, such as hardware cache coherence, are avoided. The PRISM approach is prototyped on a real Tilera Tile-Gx72 processor that enables primitives to implement the proposed cluster-level hardware resource isolation. The evaluation shows performance benefits from avoiding destructive hardware interference effects with redundant execution, while delivering superior system availability.

show abstract

Section: Related Workmentioning

confidence: 99%

Prism

Omar

Khan²

2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

show abstract

“…Authors in [7] reconfigure the redundancy of functional units of a DSP processor into a m-way replication, however the execution-time of a program will be duplicated due to assigning some functional units to fault-tolerance. Recently, a hardware\software CR-based scheme, called Reli, has been proposed in [8] which is based on elaborating microinstructions with additional micro-operations to facilitate check-pointing.…”

Section: Related Workmentioning

confidence: 99%

“…The main novel feature of the presented recovery method is isolation of the faulty functional unit from the fault-free ones for one clock-cycle, referred to as freezing, and re-executing the faulty part of the instruction. Another novel feature is the minimum amount of information needed to be stored in each functional unit; this decreases the recovery overhead to only one clock-cycle, while a typical recovery mechanism takes 16 clock-cycles for the CR-based mechanism [8]. Moreover, the speed of the enriched processor is identical to the performance of the original processor, as long as no SET is present in the system.…”

Section: Recovery Mechanism In Combinational Logicsmentioning

confidence: 99%

Two soft-error mitigation techniques for functional units of DSP processors

Rohani

Kerkhoff

2014

2014 19th IEEE European Test Symposium (ETS)

View full text Add to dashboard Cite

This paper presents two soft-error mitigation methods for DSP processors. Considering that a DSP processor is composed of several functional units and each functional unit constitutes of a control unit, some registers and combinational logic, a unique characteristic of DSP workloads has been deployed to develop a masking mechanism for the control-logic of each functional unit. Combinational logic has been elaborated with a fast recovery mechanism to isolate the fault-free functional units and re-execute the erroneous instruction. These techniques have been implemented on a DSP processor in order to assess the achieved fault-tolerance versus the imposed overheads.

show abstract

“…A hardware/software approach for detecting and recovering from errors is proposed in [317]. The fundamental idea of this approach is to re-engineer the instruction set.…”

Section: Hybrid Approachmentioning

confidence: 99%

Soft error mitigation techniques for future chip multiprocessors

Upasani

View full text Add to dashboard Cite

The sustained drive to downsize the transistors has reached a point where device sensitivity against transient faults due to neutron and alpha particle strikes a.k.a soft errors has moved to the forefront of concerns for next-generation designs. Following Moore's law, the exponential growth in the number of transistors per chip has brought tremendous progress in the performance and functionality of processors. However, incorporating billions of transistors into a chip makes it more likely to encounter a soft soft errors. Moreover, aggressive voltage scaling and process variations make the processors even more vulnerable to soft errors. Also, the number of cores on chip is growing exponentially fueling the multicore revolution. With increased core counts and larger memory arrays, the total failure-in-time (FIT) per chip (or package) increases. Our studies concluded that the shrinking technology required to match the power and performance demands for servers and future exa- and tera-scale systems impacts the FIT budget. New soft error mitigation techniques that allow meeting the failure rate target are important to keep harnessing the benefits of Moore's law. Traditionally, reliability research has focused on providing circuit, microarchitecture and architectural solutions, which include device hardening, redundant execution, lock-step, error correcting codes, modular redundancy etc. In general, all these techniques are very effective in handling soft errors but expensive in terms of performance, power, and area overheads. Traditional solutions fail to scale in providing the required degree of reliability with increasing failure rates while maintaining low area, power and performance cost. Moreover, this family of solutions has hit the point of diminishing return, and simply achieving 2X improvement in the soft error rate may be impractical. Instead of relying on some kind of redundancy, a new direction that is growing in interest by the research community is detecting the actual particle strike rather than its consequence. The proposed idea consists of deploying a set of detectors on silicon that would be in charge of perceiving the particle strikes that can potentially create a soft error. Upon detection, a hardware or software mechanism would trigger the appropriate recovery action. This work proposes a lightweight and scalable soft error mitigation solution. As a part of our soft error mitigation technique, we show how to use acoustic wave detectors for detecting and locating particle strikes. We use them to protect both the logic and the memory arrays, acting as unified error detection mechanism. We architect an error containment mechanism and a unique recovery mechanism based on checkpointing that works with acoustic wave detectors to effectively recover from soft errors. Our results show that the proposed mechanism protects the whole processor (logic, flip-flop, latches and memory arrays) incurring minimum overheads. La nanotecnología ha continuado avanzando durante las últimas décadas al ritmo marcado por la ley de Moore, que dice que los transistores reducen su tamaño en un 50% cada dos años. Esta reducción en tamaño ha permitido que los transistores sean cada vez más rápidos y que consuman menos energía. Sin embargo, este avance tecnológico se enfrenta ahora al problema de la vulnerabilidad de estos pequeños transistores, sobre todo al impacto de las partículas (soft errors). Por otro lado, el uso que se hace hoy en día de estos transistores los hace aún más vulnerables a los posibles impactos de partículas. La reducción del voltaje que se usa en los procesadores actuales, el incremento de número de procesadores que hay en los dispositivos actuales, las variaciones en el proceso de fabricación... todo ayuda a que las partículas que impactan en los transistores causes errores. Nuestros estudios concluyen que la tecnología que se necesita para poder crear los futuros supercomputadores terascale y exascale va a ser muy susceptible a los impactos de partículas, y que nuevas técnicas para detectar y corregir los errores que causan van a ser imprescindibles. Las soluciones que se usan en la actualidad, basadas en modificación de circuitos y del diseño de los procesadores no van a poder usarse en los futuros superocomputadores terascale y exascale a un coste razonable. Una nueva clase de solución que se está investigando es la de detectar los impactos de las partículas, una solución totalmente opuesta a las direcciones anteriores basadas en detectar los errores que los impactos causaban. Nuestra solución consiste en poner un conjunto de detectores en el silicio que detectarían todos los impactos de partículas que potencialmente pudieran causar errores. Una vez el impacto es detectado, si fuera necesario aplicaríamos soluciones para recuperarnos del error que hubiera podido causar. En nuestro trabajo nos centramos en sensores acústicos. La tesis propone mecanismos que nos permiten detectar y localizar los impactos de partículas basados en estos sensores acústicos. Demostramos como se pueden usar para proteger los procesadores, lógica y memoria. También proponemos una solución que nos permite contener y recuperarnos de los errores que los impactos de partículas causan una vez se detectan a través de nuestros sensores. Los resultados demuestran que el coste para proteger los futuros supercomputadores terascale y exascale es razonable y suficiente.

show abstract

Reli: Hardware/software Checkpoint and Recovery scheme for embedded processors

Cited by 7 publications

References 30 publications

Prism

Prism

Two soft-error mitigation techniques for functional units of DSP processors

Soft error mitigation techniques for future chip multiprocessors

Contact Info

Product

Resources

About