A Reliable Routing Architecture and Algorithm for NoCs

DeOrio, Andrew; Fick, David; Bertacco, Valeria; Sylvester, Dennis; Blaauw, David; Hu, Jin; Chen, G.

doi:10.1109/tcad.2011.2181509

Cited by 74 publications

(17 citation statements)

References 60 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The hardware of the routers in the system is checked as frequently as the cores using tests that require 150 000 cycles [51]. Router's lists of local connections is kept updated much more frequently, every 10 000 cycles, to ensure correct communication between directly connected nodes.…”

Section: G Full System Performance Analysismentioning

confidence: 99%

See 1 more Smart Citation

Cardio: CMP Adaptation for Reliability Through Dynamic Introspective Operation

Pellegrini

Bertacco

2014

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-A modern digital system includes in a single chip many components: processing cores, large caches, memory controllers, and hardware accelerators. Looking forward, future semiconductor technologies will enable even higher device integration, overall increasing system performance while reducing energy consumption. Unfortunately, prominent experts agree that such technologies will be prone to both permanent and transient faults within their lifetime. With the goal of addressing this issue, we propose Cardio: a low-cost architecture for reliable chip multiprocessors. Our solution is based on a novel hardware/software co-design where silicon failures are detected in hardware and system reconfiguration is managed in software. Comparing Cardio with a state-of-the-art hardwarebased resiliency solution, Immunet, we found that our design can achieve a comparable fault response time while requiring a much lower area overhead. The proposed solution relies on a distributed resource manager to collect information about a CMP component's health, and leverages a synchronized distributed control mechanism to recover from permanent failures. Such architecture can operate as long as at least one general-purpose processor is still functional. Our experimental evaluation indicates that the overall performance impact of Cardio is as low as 4.5%, and its dynamic reconfiguration time upon fault detection is comprised between 20 and 50 thousand cycles.

show abstract

Section: G Full System Performance Analysismentioning

confidence: 99%

“…Storage requirements grow linearly with system size, and thus Cardio benefits are even more marked for larger CMPs. When we consider both reconfigurable routing tables and router self-test logic, interconnect area increases by approximately 11.4% [51].…”

Section: H Area Overheadmentioning

confidence: 99%

Cardio: CMP Adaptation for Reliability Through Dynamic Introspective Operation

Pellegrini

Bertacco

2014

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…Upon a fault affecting a link (or a portion of a router impacting link operability) the disabled turns must be recomputed to allow packets to go through alternative surviving routes. This effort entails a global routing reconfiguration [4], and it does not guarantee deadlock-freedom. Up*/down* routing: Spanning tree-based routing algorithms, such as up*/down* routing [17], can be applied to route packets in any topology.…”

Section: A Alternative Route Generation Cost Analysismentioning

confidence: 99%

“…Most of them can be grouped into two families based on their approach to reconfiguration. The first family deploys routing tables and logic that are updated upon each fault occurrence [1,4,15,22]. This approach is topology-agnostic and, in the best case, it can tolerate an arbitrary number of faults, but suffers from high reconfiguration overhead.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Brisk and limited-impact NoC routing reconfiguration

Lee

Parikh

Bertacco

2014

Design, Automation &Amp; Test in Europe Conference &Amp; Exhibition (DATE), 2014

Self Cite

View full text Add to dashboard Cite

Abstract-The expected low reliability of the silicon substrate at upcoming technology nodes presents a key challenge for digital system designers. Networks-on-chip (NoCs) are especially concerning because they are often the only communication infrastructure for the chips in which they are deployed. Recently, routing reconfiguration solutions have been proposed to address this problem. However, they come at a high silicon cost, and often require suspending the normal network activity while executing a centralized, resource-hungry reconfiguration algorithm. This paper proposes a novel, fast and minimalistic routing reconfiguration algorithm, called BLINC. BLINC utilizes precomputed routing metadata to quickly evaluate localized detours upon each fault manifestation. We showcase the efficacy of our algorithm by deploying it in a novel NoC fault detection and reconfiguration solution, where BLINC enables uninterrupted NoC operation during aggressive online testing. If a fault seems likely to occur, we circumvent it in advance with the aid of our BLINC reconfiguration algorithm. Experimental results show an 80% reduction in the average number of routers affected by a reconfiguration event, compared to state-of-the-art techniques. BLINC enables negligible performance degradation in our detection and reconfiguration solution, while solutions based on current techniques suffer a 17-fold latency increase.

show abstract