The Celerity Open-Source 511-Core RISC-V Tiered Accelerator Fabric: Fast Architectures and Design Methodologies for Fast Chips

Davidson, Scott; Xie, Shijie; Torng, Christopher; Al-Hawai, Khalid; Rovinski, Austin; Ajayi, Tutu; Vega, Luis; Zhao, Chun; Zhao, Ritchie; Dai, Steve; Amarnath, Aporva; Veluri, Bandhav; Gao, Paul; Rao, Anuj; Liu, Gai; Gupta, Rajesh K.; Zhang, Zhiru; Dreslinski, Ronald G.; Batten, Christopher; Taylor, Michael

doi:10.1109/mm.2018.022071133

Cited by 63 publications

(22 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Many NoC designs employ link widths that accommodate a whole packet [16], [17], [18]. In such case, packets are singleflit and there is no difference between WH and VCT.…”

Section: Flow Control and Deadlock Avoidancementioning

confidence: 99%

Efficient bypass in mesh and torus NoCs

Perez

Vallejo

Beivide

2020

Journal of Systems Architecture

View full text Add to dashboard Cite

Minimizing latency and power are key goals in the design of NoC routers. Different proposals combine lookahead routing and router bypass to skip the arbitration and buffering, reducing router delay. However, the conditions to use them requires completely empty buffers in the intermediate routers. This restricts the amount of flits that use the bypass pipeline especially at medium and high loads, increasing latency and power. This paper presents NEBB, Non-Empty Buffer Bypass, a mechanism that allows to bypass flits even if the buffers to bypass are not empty. The mechanism applies to wormhole and virtual-cut-through, each of them with different advantages. NEBB-Hybrid is proposed to employ the best flow control in each situation. The mechanism is extended to torus topologies, using FBFC and shared buffers. The proposals have been evaluated using Booksim, showing up to 75% reduction of the buffered flits for single-flit packets, which translates into latency and dynamic power reductions of up to 30% and 23% respectively. For bimodal traffic, these improvements are 20 and 21% respectively. Additionally, the bypass utilization is largely independent of the number of VCs when using shared buffers and very competitive with few private ones, allowing to simplify the allocation mechanisms.

show abstract

“…Many NoC designs employ link widths that accommodate a whole packet [16], [17], [18]. In such case, packets are singleflit and there is no difference between WH and VCT.…”

Section: Flow Control and Deadlock Avoidancementioning

confidence: 99%

Efficient bypass in mesh and torus NoCs

Perez

Vallejo

Beivide

2020

Journal of Systems Architecture

View full text Add to dashboard Cite

show abstract

“…An obvious solution is to design systems with multiple ASICs, but that leads to high under-utilization for applications with cascaded kernels. Moreover, fast-moving domains, such as ML, involve algorithms that evolve faster than the turnaround time to fabricate and test new ASICs, despite efforts on accelerating the design flow [20], thus subjecting them to near-term obsolescence [16,43]. Finally, ASICs are generally non-programmable, barring a few that use sophisticated software frameworks [1].…”

Section: Contemporary Computing Platformsmentioning

confidence: 99%

Transmuter

Pal

Feng

Park

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

Self Cite

View full text Add to dashboard Cite

With the end of Dennard scaling and Moore's law, it is becoming increasingly difficult to build hardware for emerging applications that meet power and performance targets, while remaining flexible and programmable for end users. This is particularly true for domains that have frequently changing algorithms and applications involving mixed sparse/dense data structures, such as those in machine learning and graph analytics. To overcome this, we present a flexible accelerator called Transmuter, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-Specific Integrated Circuits (ASICs). Transmuter adapts to changing kernel characteristics, such as data reuse and control divergence, through the ability to reconfigure the on-chip memory type, resource sharing and dataflow at run-time within a short latency. This is facilitated by a fabric of lightweight cores connected to a network of reconfigurable caches and crossbars. Transmuter addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications. Finally, in order to support programmability and ease-of-adoption, we prototype a software stack composed of low-level runtime routines, and a high-level language library called TransPy, that cater to expert programmers and end-users, respectively. Our evaluations with Transmuter demonstrate average throughput (energy-efficiency) improvements of 5.0× (18.4×) and 4.2× (4.0×) over a high-end CPU and GPU, respectively, across a diverse set of kernels predominant in graph analytics, scientific computing and machine learning. Transmuter achieves energy-efficiency gains averaging 3.4× and 2.0× over prior FPGA and CGRA implementations of the same kernels, while remaining on average within 9.3× of state-of-the-art ASICs. CCS CONCEPTS • Computer systems organization → Reconfigurable computing; Data flow architectures; Multicore architectures.

show abstract

“…RISC-V Cores. Moreover, we have Linux-capable implementations of RISC-V processors like BlackParrot [22], ETH Zurich Ariane [29], and Berkeley Rocket [11], as well as GP-GPU-style compute throughput fabrics like HammerBlade Manycore [8] (descended from Celerity [9,14,23]), microcontrollers like Western Digital's SweRV [3], and scalable multicore server processors like the RISC-V incarnation of Princeton OpenPiton [12]. RISC-V unlocking research and education.…”

Section: Risc-vmentioning

confidence: 99%