Pareto Optimal Design Space Exploration for Accelerated CNN on FPGA

Reggiani, Enrico; Rabozzi, Marco; Nestorov, Anna Maria; Scolari, Alberto; Stornaiuolo, Luca; Santambrogio, Marco D.

doi:10.1109/ipdpsw.2019.00028

Cited by 13 publications

(8 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similarly, dataflows exploiting the local computation of feature maps between CNN layers have already been employed in a variety of hardware accelerators. For example, dataflows that fuse the computation between subsets of subsequent layers within a CNN (Alwani et al, 2016 ) have been used in implementations of CNNs on FPGA-based accelerators (Reggiani et al, 2019 ). Moreover, such advanced dataflows have been proposed for heterogeneous architectures in order to improve throughput and avoid use of off-chip memory (Wei et al, 2018 ).…”

Section: Discussionmentioning

confidence: 99%

Accelerating Inference of Convolutional Neural Networks Using In-memory Computing

Dazzi

Sebastian

Benini

et al. 2021

Front. Comput. Neurosci.

View full text Add to dashboard Cite

In-memory computing (IMC) is a non-von Neumann paradigm that has recently established itself as a promising approach for energy-efficient, high throughput hardware for deep learning applications. One prominent application of IMC is that of performing matrix-vector multiplication in O(1) time complexity by mapping the synaptic weights of a neural-network layer to the devices of an IMC core. However, because of the significantly different pattern of execution compared to previous computational paradigms, IMC requires a rethinking of the architectural design choices made when designing deep-learning hardware. In this work, we focus on application-specific, IMC hardware for inference of Convolution Neural Networks (CNNs), and provide methodologies for implementing the various architectural components of the IMC core. Specifically, we present methods for mapping synaptic weights and activations on the memory structures and give evidence of the various trade-offs therein, such as the one between on-chip memory requirements and execution latency. Lastly, we show how to employ these methods to implement a pipelined dataflow that offers throughput and latency beyond state-of-the-art for image classification tasks.

show abstract

Section: Discussionmentioning

confidence: 99%

Accelerating Inference of Convolutional Neural Networks Using In-memory Computing

Dazzi

Sebastian

Benini

et al. 2021

Front. Comput. Neurosci.

View full text Add to dashboard Cite

show abstract

“…Resource and performance models are proposed by Reggiani et al [169] for convolutional neural network (CNN) accelerators, to drive an automatic Pareto-optimal DSE, exploring network performance on different hardware platforms. These models are applied to convolutional cores, which are critical components of the design, directly affecting the overall latency and DSP utilization.…”

Section: A Modelsmentioning

confidence: 99%

High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

et al. 2022

View full text Add to dashboard Cite

Hardware accelerators based on field programmable gate array (FPGA) and system on chip (SoC) devices have gained attention in recent years. One of the main reasons is that these devices contain reconfigurable logic, which makes them feasible for boosting the performance of applications. High-level synthesis (HLS) tools facilitate the creation of FPGA code from a high level of abstraction using different directives to obtain an optimized hardware design based on performance metrics. However, the complexity of the design space depends on different factors such as the number of directives used in the source code, the available resources in the device, and the clock frequency. Design space exploration (DSE) techniques comprise the evaluation of multiple implementations with different combinations of directives to obtain a design with a good compromise between different metrics. This paper presents a survey of models, methodologies, and frameworks proposed for metric estimation, FPGA-based DSE, and power consumption estimation on FPGA/SoC. The main features, limitations, and trade-offs of these approaches are described. We also present the integration of existing models and frameworks in diverse research areas and identify the different challenges to be addressed. INDEX TERMSComputing models, design space exploration, field programmable gate array (FPGA), system on chip (SoC), power consumption.years (2016-2022) and have been selected based on the topics addressed in this survey. Several papers published before 2016 have been considered because of their contributions to the current literature. C. OUTLINEThe remainder of this paper is organized as follows. Section II briefly presents the most widely used parallel computing models for CPU, GPU, and multicore processors. Section III introduces the FPGA-based reconfigurable hardware accelerator architectures, hardware/software co-design, DSE and metrics, and the techniques to improve latency, area, and power for this technology. In Section IV, we describe previous works on models, methodologies, and frameworks proposed for FPGA/SoC according to their main features: metrics estimation (IV-A), FPGA-based DSE (IV-B), and power consumption estimation (IV-C); and in Section IV-D, we present a summary and discussion. The integration of models and frameworks for FPGA-based reconfigurable hardware accelerators in different research fields is exposed in Section V. Challenges are analyzed in Section VI. Finally, conclusions are presented in Section VII. II. PARALLEL COMPUTING MODELS FOR PERFORMANCE ESTIMATIONComputing models allow to easily analyzing algorithms by simplifying the computational world to a reduced set of parameters that define the cost of arithmetic and memory access operations and communication. These models contribute to the search for efficient algorithms for a given architecture, improving the productivity of designers, programmers, and engineers. A small amount of communication, a small number of operations, and a high degree of parallelism are key points that d...

show abstract

“…However, their model is based on the assumption that the performance/area changes monotonically by modifying an individual design parameter, which is not a valid assumption as we explained in Challenge 2 of Section 1. To increase the accuracy of the estimation model, a number of other studies restrict the target application to those that have a well-defined accelerator micro-architecture template [6,11,12,28,32,42], a specific application [39,45], or a particular computation pattern [7,19,25] hence, they lose generality.…”

Section: Model-based Techniquesmentioning

confidence: 99%

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Sohrabizadeh¹,

Yu²,

Min³

et al. 2020

Preprint

View full text Add to dashboard Cite

Adopting FPGA as an accelerator in datacenters is becoming mainstream for customized computing, but the fact that FPGAs are hard to program creates a steep learning curve for software programmers. Even with the help of high-level synthesis (HLS), accelerator designers still have to manually perform code reconstruction and cumbersome parameter tuning to achieve the optimal performance. While many learning models have been leveraged by existing work to automate the design of efficient accelerators, the unpredictability of modern HLS tools becomes a major obstacle for them to maintain high accuracy. In this paper, we address this problem by incorporating an automated DSE framework-AutoDSE-that leverages a bottleneck-guided gradient optimizer to systematically find a better design point. AutoDSE finds the bottleneck of the design in each step and focuses on high-impact parameters to overcome that, which is similar to the approach an expert would take. The experimental results show that AutoDSE is able to find the design point that achieves, on the geometric mean, 19.9× speedup over one CPU core for Machsuite and Rodinia benchmarks and 1.04× over the manually designed HLS accelerated vision kernels in Xilinx Vitis libraries yet with 26× reduction of their optimization pragmas. With less than one optimization pragma per design on average, we are making progress towards democratizing customizable computing by enabling software programmers design efficient FPGA accelerators.

show abstract

Pareto Optimal Design Space Exploration for Accelerated CNN on FPGA

Cited by 13 publications

References 16 publications

Accelerating Inference of Convolutional Neural Networks Using In-memory Computing

Accelerating Inference of Convolutional Neural Networks Using In-memory Computing

High-Level Synthesis Hardware Design for FPGA-Based Accelerators: Models, Methodologies, and Frameworks

AutoDSE: Enabling Software Programmers to Design Efficient FPGA Accelerators

Contact Info

Product

Resources

About