Abstract-Escalating system-on-chip design complexity is pushing the design community to raise the level of abstraction beyond register transfer level. Despite the unsuccessful adoptions of early generations of commercial high-level synthesis (HLS) systems, we believe that the tipping point for transitioning to HLS methodology is happening now, especially for field-programmable gate array (FPGA) designs. The latest generation of HLS tools has made significant progress in providing wide language coverage and robust compilation technology, platform-based modeling, advancement in core HLS algorithms, and a domain-specific approach. In this paper, we use AutoESL's AutoPilot HLS tool coupled with domain-specific system-level implementation platforms developed by Xilinx as an example to demonstrate the effectiveness of state-of-art C-to-FPGA synthesis solutions targeting multiple application domains. Complex industrial designs targeting Xilinx FPGAs are also presented as case studies, including comparison of HLS solutions versus optimized manual designs. In particular, the experiment on a sphere decoder shows that the HLS solution can achieve an 11-31% reduction in FPGA resource usage with improved design productivity compared to hand-coded design.Index Terms-Domain-specific design, field-programmable gate array (FPGA), high-level synthesis (HLS), quality of results (QoR).
In this paper we present an approach for quantitative analysis of application-specific dataflow architectures. The approach allows the designer to rate design alternatives in a quantitative 1: IntroductionIn the application domain of real-time video, the required processing power is in the order of hundreds of Risc-like operations per pixel, while the data rate of pixel streams is in the range of 10 to 100 Msamples per second. Consequently architectures are needed that perform billions of operations per second and have an internal communication bandwidth of Gbytes per second.In the application domain of real-time video we focus on dedicated architectures that support the concept of streams [17] and achieve the required performance by exploiting the inherent parallelism of the applications on domain-specific, coarse-grain processors, with limited internal flexibility (i.e. weakly programmable). An example of such a domain-specific architecture is given in figure 1. The architecture consists of different dedicated application-specific coarse-grain processors that operate independently of each other on data-streams. These streams are exchanged between the coarse-grain processors via a communication network and is controlled by some global controller. These kinds of architectures are typically embedded in a larger system that also contains memory and a general purpose processor, e.g. a Risc processor.In the design of these architectures, many choices have to be made. In this paper we present a simulation environment that aids the designer in making these choices based on quantitative information. In section 2 we present our problem statement. A solution approach is given in section 3. In section 4 we review related work of quantitative evaluation of design alternatives. The solution approach is further detailed for application-specific dataflow architectures in the following sections. In
We present a methodology for the exploration of signal processing architectures at the system level. The methodology, named SPADE, provides a means to quickly build models of architectures at an abstract level, to easily map applications, modeled as Kahn Process Networks, onto these architecture models, and to analyze the performance of the resulting system by simulation. The methodology distinguishes between applications and architectures, and uses a trace-driven simulation technique for co-simulation of application models and architecture models. As a consequence, architecture models need not be functionally complete to be used for performance analysis while data dependent behavior is still handled correctly. We have used the methodology for the exploration of architectures and mappings of an MPEG-2 decoder application.
Research has shown that convolutional neural networks contain significant redundancy, and high classification accuracy can be obtained even when weights and activations are reduced from floating point to binary values. In this paper, we present Finn, a framework for building fast and flexible FPGA accelerators using a flexible heterogeneous streaming architecture. By utilizing a novel set of optimizations that enable efficient mapping of binarized neural networks to hardware, we implement fully connected, convolutional and pooling layers, with per-layer compute resources being tailored to user-provided throughput requirements. On a ZC706 embedded FPGA platform drawing less than 25 W total system power, we demonstrate up to 12.3 million image classifications per second with 0.31 µs latency on the MNIST dataset with 95.8% accuracy, and 21906 image classifications per second with 283 µs latency on the CIFAR-10 and SVHN datasets with respectively 80.1% and 94.9% accuracy. To the best of our knowledge, ours are the fastest classification rates reported to date on these benchmarks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.