A Survey on Design Approaches to Circumvent Permanent Faults in Networks-on-Chip

Werner, Sebastian; Navaridas, Javier; Luján, Mikel

doi:10.1145/2886781

Cited by 35 publications

(21 citation statements)

References 119 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Moreover, approaches focus on different types of random hardware faults: transient and intermittent faults [16]- [19]; permanent faults [24]; or both [20]- [23]. Comprehensive overviews are found in [25] and [26]. The key technique varies with the approach: from retransmission protocols and adaptive routing to stochastic broadcasts.…”

Section: Related Workmentioning

confidence: 99%

Providing Integrity in Real-Time Networks-on-Chip

Rambo

Shang

Ernst

2019

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

Mixed-critical real-time systems must meet strict integrity, resilience and timing constraints, as specified by safety standards. Due to the increasing threat of random hardware faults, efficiently achieving high reliability and dependability calls for cross-layer fault-tolerance solutions. This work introduces the Advanced Integrity Q-service (AIQ), a mechanism to ensure the integrity and predictability of on-Chip communication under random hardware faults. Devised for cross-layer and hierarchical fault-tolerance solutions, AIQ realizes low-overhead error detection in hardware and delegates error handling to arbitrary strategies in software. Experimental evaluation featuring benchmark applications and an industrial avionics use case shows that AIQ operates with high reliability and availability and low hardware and performance overheads. In a many-core mixed-critical platform under expected real-time scenarios, AIQ performs with execution time overhead between 1.4% and 7.1%.

show abstract

Section: Related Workmentioning

confidence: 99%

Providing Integrity in Real-Time Networks-on-Chip

Rambo

Shang

Ernst

2019

IEEE Trans. VLSI Syst.

View full text Add to dashboard Cite

show abstract

“…In more detail, a chip exhibits different error rates in different periods of its lifetime. Without loss of generality, this variability has been modelled as a "bathtub" curve [39], as shown in Figure 3. In its infant period, there is a very high but decreasing failure rate, until a plateau of minimum, constant failure rate is reached at its grace period.…”

Section: Target Application Modelmentioning

confidence: 99%

“…The distributed nature of the targeted systems on both Processing Elements (PEs) and Resource Management imposes extra design requirements and increased complexity to provide fault tolerance guarantees in an online and timely manner. Therefore, it has been identified that in order to effectively mitigate variability issues in kilo-core SoCs, it is mandatory to intervene and leverage techniques for increased dependability in all layers of system design ranging from hardware [39] to high level application development [18,31].…”

Section: Introductionmentioning

confidence: 99%

SoftRM

Tsoutsouras

Masouros

Xydis

et al. 2017

ACM Trans. Embed. Comput. Syst.

View full text Add to dashboard Cite

Many-core systems are envisioned to leverage the ever-increasing demand for more powerful computing systems. To provide the necessary computing power, the number of Processing Elements integrated onchip increases and NoC based infrastructures are adopted to address the interconnection scalability. The advent of these new architectures surfaces the need for more sophisticated, distributed resource management paradigms, which in addition to the extreme integration scaling, make the new systems more prone to errors manifested both at hardware and software. In this work, we highlight the need for Run-Time Resource management to be enhanced with fault tolerance features and propose SoftRM, a resource management framework which can dynamically adapt to permanent failures in a self-organized, workload-aware manner. Self-organization allows the resource management agents to recover from a failure in a coordinated way by electing a new agent to replace the failed one, while workload awareness optimizes this choice according to the status of each core. We evaluate the proposed framework on Intel Single-chip Cloud Computer (SCC), a NoC based many-core system and customize it to achieve minimum interference on the resource allocation process. We showcase that its workload-aware features manage to utilize free resources in more that 90% of the conducted experiments. Comparison with relevant state-of-the-art fault tolerant frameworks shows decrease of up to 67% in the imposed overhead on application execution. CCS Concepts: • General and reference → Cross-computing tools and techniques; • Computer systems organization → Multicore architectures; • Networks → Network on chip; • Computing methodologies → Self-organization;

show abstract

“…This issue also affects link and router of NoC that must require a specific attention, in order to maximize yield and to ensure correct operation. This emphasizes the significance of robust design solutions and has led to fault tolerance becoming a fundamental design constraint [3]. In this context, many fault tolerance techniques have been proposed at several levels (circuit/system and hardware/software) for critical applications.…”

Section: Introductionmentioning

confidence: 99%

Collaborative Routing Algorithm for Fault Tolerance in Network on Chip CRAFT NoC

Nehnouh¹,

Senouci²,

Chaib³

2017

ijacsa

View full text Add to dashboard Cite

Abstract-Many fault tolerance techniques have been proposed in Network on Chip to cope with defects during fabrication or faults during product lifetime. Fault tolerance routing algorithm provide reliable mechanisms for continue delivering their services in spite of defective nodes due to the presence of permanent and/or transient faults throughout their lifetime implementation. This paper presents a new approach in the domain of fault-tolerant NoC with two main contributions. Firstly, we consider a unified fault model that include transient faults, permanent faults and congestion considered as a fault. Secondly, we present a new architecture based on sub-nets and give an overview of the associated test and (re)routing algorithm. The main result of this paper, is a new routing algorithm called Collaborative Routing Algorithm for Fault Tolerance in Network on Chip (CRAFT-NoC). We compare our approach with ACO-FAR that considers as well congestion and permanent faults. Our simulation results show significant improvements in terms of both latency and reliability.

show abstract

A Survey on Design Approaches to Circumvent Permanent Faults in Networks-on-Chip

Cited by 35 publications

References 119 publications

Providing Integrity in Real-Time Networks-on-Chip

Providing Integrity in Real-Time Networks-on-Chip

SoftRM

Collaborative Routing Algorithm for Fault Tolerance in Network on Chip CRAFT NoC

Contact Info

Product

Resources

About