Network-on-chip (NoC) has become the mainstream fabric architecture for chip multiprocessor (CMP) design. Owing to the market-driven advancement of modern applications in CMP, multicast traffic is aggressively increasing to support barrier synchronization, multithreading, and cache coherence protocols. Although multicast by branching of packets in the NoC router facilitates shortest path routing, additional branching-induced deadlocks must be circumvented. Existing NoC studies on deadlock-free minimal path routing in multicast traffic have typically deployed additional virtual channels or large buffers to hold entire packets, thereby significantly increasing the router area. Focusing on the area-efficient solution while sustaining the performance, we propose a novel multicast router using buffer sharing (MRBS) to guarantee deadlock-free multicast routing by exploiting the spatial diversity of the input buffer. MRBS ensures minimal path routing without requiring additional virtual channels or large buffers to hold entire packets. Extensive experiments were conducted by varying the buffer, packet, and network sizes, as well as the number of destinations per packet, under random multicast traffic with diverse injection rates. Simulation results show that MRBS achieves a 39.3 % improvement in the area-delay product on average for various network sizes compared to the conventional tree-based router. INDEX TERMSArea-efficient design, buffer sharing, deadlock recovery, multicast communication, network-on-chip, router architecture, tree-based routing Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
The relentless proliferation of big data and artificial intelligence has compelled computing platform architectures to evolve into heterogeneous multicores for greater energy efficiency. A customized network-on-chip (NoC) supporting interconnection diversity is pivotal for the asymmetric data-access traffic requirements of modern heterogeneous multicore system-on-chip (SoC). A significant portion of on-chip data access comprises single-source multi-destination (SSMD) traffic, which supports barrier synchronization, multi-threading, cache coherency protocols, and deep neural network (DNN) acceleration. By amortizing SSMD traffic, multicast routing is essential for effectively utilizing communication bandwidth. One of the primary concerns in supporting multicast routing in NoCs is to circumvent the additional deadlock conditions caused by branch operations among the active routers. However, it is challenging to implement the throughput-optimized multicast routing in irregular topology-based NoCs because the deadlock conditions become highly complicated, and the Hamiltonian path required to apply the labeling rule may not exist. Two important observations were identified regarding multicast routing in customized NoCs: 1) Even if the NoC lacks a Hamiltonian path, deadlock-freedom can be guaranteed by restricting branch operations to a specific destination. 2) A variable path diversity in a custom topology can be leveraged in routing path allocation and branch. Based on these properties, this study proposes a deadlock-free and throughput-enhanced multicast routing for customized NoC (MRCN). MRCN ensures deadlock freedom by utilizing extended routing and router labeling rules. Furthermore, destination router partitioning and traffic-aware adaptive branching are incorporated to reduce packet routing hops and disperse channel traffic. The effectiveness of MRCN was verified using Noxim, a well-known cycle-accurate NoC simulator, under various topologies and traffic patterns. The simulation revealed that MRCN improved the average latency by 13.98 % and the throughput by 12.16 % under the saturated traffic conditions over the previous multicast routings in customized NoCs.
Processing-in-memory (PIM) comprises computational logic in the memory domain. It is the most promising solution to alleviate the memory bandwidth problem in deep neural network (DNN) processing. The hybrid memory cube (HMC), a 3D stacked memory structure, can efficiently implement the PIM architecture by maximizing the existing legacy hardware. To accelerate DNN inference, multiple HMCs can be connected, and data-independent tasks can be assigned to processing elements (PEs) within each HMC. However, owing to the packet-switched network structure, inter-HMC interconnects exhibit variable and unpredictable latencies depending on the data transmission path and link contention. A welldesigned task schedule using context switching can effectively hide communication latency and improve PE utilization. Nevertheless, as the number of HMC increases, the variability of a wide range of inter-HMC communication latencies causes frequent context switching, degrading overall performance. This paper proposes a DNN task scheduling that can effectively utilize task parallelism by reducing the communication latency variance owing to HMC interconnect characteristics. Task partitions are generated to exploit parallelism while providing inter-HMC traffic within the sustainable link bandwidth. Task-to-HMC mapping is performed to hide the average communication latency of intermediate DNN processing results. A task schedule is generated using retiming to accelerate DNN inference while maximizing resource utilization. The effectiveness of the proposed method was verified through simulations using various realistic DNN applications performed on a ZSim x86-64 simulator. The simulations revealed that DNN processing with the proposed scheduling improved the DNN processing speed by reducing the processing time by 18.19% over conventional methods where each HMC operated independently.INDEX TERMS Processing-in-memory, hybrid memory cube, deep neural network, task scheduling, parallel computing
As semiconductor processes enter the nanoscale, system-on-chip (SoC) interconnects suffer from link aging owing to negative bias temperature instability (NBTI), hot carrier injection (HCI), and electromigration. In network-on-chip (NoC) for heterogeneous manycore systems, there is a difference in the aging speed of links depending on the location and utilization of resources. In this paper, we propose a heterogeneous manycore NoC topology synthesis that predicts the aging effect of each link and deploys routers and error correction code (ECC) logic. Aging-aware ECC logic is added to each link to achieve the same link lifetime with less area and latency than the Bose-Chaudhuri-Hocquenghem (BCH) logic. Moreover, based on the modified genetic algorithm, we search for a solution that minimizes the average latency while ensuring the link lifetime by changing the number of routers, location, and network connectivity. Simulation results demonstrate that the aging-aware topology synthesis reduces the average latency of the network by up to 26.68% compared with the aging analysis and the addition of ECC logic on the link after the topology synthesis. Furthermore, topology synthesis with aging-aware ECC logic reduces the maximum average latency by up to 39.49% compared with added BCH logic.improve NoC performance in heterogeneous manycore NoC designs. Existing schemes have applied a fixed number of routers in the chip. These methods make it possible to find a reasonable solution in NoC with a specific number of routers. However, in situations where the number of routers is not assigned, algorithms must be executed several times for different numbers of routers, which requires additional computation time.In contrast, due to the miniaturization of semiconductor processes, delay faults in communication data due to the aging of flip-flops and metal wires have become a significant concern in the SoC interconnect design [8][9][10]. In the nanoscale process, aging-induced delay faults occur mainly due to negative bias temperature instability (NBTI), hot carrier injection (HCI), and electromigration [11][12][13][14][15]. NBTI and HCI increase the threshold voltage of the transistors, and electromigration increases the resistance in metal wires, resulting in longer data transfer delay [8,9,15].Few studies have considered the aging effect in the high-level design of the on-chip interconnect, because it was treated as a low-level design problem. However, recent studies show that in the topology synthesis of heterogeneous manycore NoC, the aging process can be predicted based on the length and communication load of each link [14][15][16]. If the aging effect of each link is taken into consideration during the NoC topology synthesis, it is possible to reduce the performance degradation through the placement of interconnect modules and links. Moreover, this will aid in the recovery of aging resilience by correcting the error even if a delay fault occurs by using the error correction logic.The forward error correction, based on error cor...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.