Abstract-Incorporating Networks-on-Chip (NoC) within FPGAs has the potential not only to improve the efficiency of the interconnect, but also to increase designer productivity and reduce compile time by raising the abstraction level of communication. By comparing NoC components on FPGAs and ASICs we quantify the efficiency gap between the two platforms and use the results to understand the design tradeoffs in that space. The crossbar has the largest FPGA vs. ASIC gaps: 85× area and 4.4× delay, while the input buffers have the smallest: 17× area and 2.9× delay. For a soft NoC router, these results indicate that wide datapaths, deep buffers and a small number of ports and virtual channels (VC) are favorable for FPGA implementation. If one hardens a complete state-of-the-art VC router it is on average 30× more area efficient and can achieve 3.6× the maximum frequency of a soft implementation. We show that this hard router can be integrated with the soft FPGA interconnect, and still achieve an area improvement of 22×. A 64-node NoC of hard routers with soft interconnect utilizes area equivalent to 1.6% of the logic modules in the latest FPGAs, compared to 33% for a soft NoC.
Please refer to published version for the most recent bibliographic citation information. If a published version is known of, the repository item page linked to above, will contain details on accessing it.
Overlays have shown significant promise for fieldprogrammable gate-arrays (FPGAs) as they allow for fast development cycles and remove many of the challenges of the traditional FPGA hardware design flow. However, this often comes with a significant performance burden resulting in very little adoption of overlays for practical applications. In this paper, we tailor an overlay to a specific application domain, and we show how we maintain its full programmability without paying for the performance overhead traditionally associated with overlays. Specifically, we introduce an overlay targeted for deep neural network inference with only~1% overhead to support the control and reprogramming logic using a lightweight very-long instruction word (VLIW) network. Additionally, we implement a sophisticated domain specific graph compiler that compiles deep learning languages such as Caffe or Tensorflow to easily target our overlay. We show how our graph compiler performs architecture-driven software optimizations to significantly boost performance of both convolutional and recurrent neural networks (CNNs/RNNs) -we demonstrate a 3× improvement on ResNet-101 and a 12× improvement for long short-term memory (LSTM) cells, compared to naïve implementations. Finally, we describe how we can tailor our hardware overlay, and use our graph compiler to achieve~900 fps on GoogLeNet on an Intel Arria 10 1150 -the fastest ever reported on comparable FPGAs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.