From high-level deep neural models to FPGAs

Sharma, Hardik; Park, Jongse; Mahajan, Divya; Amaro, Emmanuel; Kim, Joon-Kyung; Shao, Chenkai; Mishra, Asit K.; Esmaeilzadeh, Hadi

doi:10.1109/micro.2016.7783720

Cited by 392 publications

(224 citation statements)

References 25 publications

Supporting

Mentioning

220

Contrasting

Unclassified

Order By: Relevance

“…These techniques can be integrated into Multi-CLP designs. [24] and [30] propose complete frameworks for generating FPGAbased accelerators from CNN specifications. Our Multi-CLP approach can be integrated into these frameworks to improve the performance of auto-generated accelerators.…”

Section: Related Workmentioning

confidence: 99%

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming

FerdmanMichael

MilderPeter

2017

SIGARCH Comput. Archit. News

133

149

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are revolutionizing machine learning, but they present significant computational challenges. Recently, many FPGA-based accelerators have been proposed to improve the performance and efficiency of CNNs. Current approaches construct a single processor that computes the CNN layers one at a time; the processor is optimized to maximize the throughput at which the collection of layers is computed. However, this approach leads to inefficient designs because the same processor structure is used to compute CNN layers of radically varying dimensions.We present a new CNN accelerator paradigm and an accompanying automated design methodology that partitions the available FPGA resources into multiple processors, each of which is tailored for a different subset of the CNN convolutional layers. Using the same FPGA resources as a single large processor, multiple smaller specialized processors increase computational efficiency and lead to a higher overall throughput. Our design methodology achieves 3.8x higher throughput than the state-of-the-art approach on evaluating the popular AlexNet CNN on a Xilinx Virtex-7 FPGA. For the more recent SqueezeNet and GoogLeNet, the speedups are 2.2x and 2.0x.

show abstract

Section: Related Workmentioning

confidence: 99%

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

ShenYongming

FerdmanMichael

MilderPeter

2017

SIGARCH Comput. Archit. News

133

149

View full text Add to dashboard Cite

show abstract

“…However, recent research on DNNs is still increasing the depth of models and introducing new architectures, resulting in higher number of parameters per network and higher computational complexity. Other than CPUs and GPUs, FPGAs are becoming a platform candidate to achieve energy efficient neural network computation [12], [13], [22], [24]- [27]. Equipped with the necessary hardware for basic DNN operations, FPGAs are able to achieve high parallelism and utilize the properties of neural network computation to remove unnecessary logic.…”

Section: Prior Work On Accelerating Dnns For Fpgasmentioning

confidence: 99%

“…Prior works have shown FPGAs to be successful in accelerating the inference of pre-trained neural networks by providing custom data paths to achieve high parallelism. A vast amount of such research focuses on accelerating neural networks in the image domain [12], [13], speech recognition [14], [15] and language modelling [16]. To the best of our knowledge, similar efforts have not been made for accelerating neural networks for speech/audio synthesis.…”

Section: Introductionmentioning

confidence: 99%

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Hussain

Javaheripi

Neekhara

et al. 2019

2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

View full text Add to dashboard Cite

Autoregressive convolutional neural networks (CNNs) have been widely exploited for sequence generation tasks such as audio synthesis, language modeling and neural machine translation. WaveNet is a deep autoregressive CNN composed of several stacked layers of dilated convolution that is used for sequence generation. While WaveNet produces state-of-the art audio generation results, the naive inference implementation is quite slow; it takes a few minutes to generate just one second of audio on a high-end GPU. In this work, we develop the first accelerator platform FastWave for autoregressive convolutional neural networks, and address the associated design challenges. We design the Fast-Wavenet inference model in Vivado HLS and perform a wide range of optimizations including fixed-point implementation, array partitioning and pipelining. Our model uses a fully parameterized parallel architecture for fast matrix-vector multiplication that enables per-layer customized latency fine-tuning for further throughput improvement. Our experiments comparatively assess the tradeoff between throughput and resource utilization for various optimizations. Our best WaveNet design on the Xilinx XCVU13P FPGA that uses only on-chip memory, achieves 66× faster generation speed compared to CPU implementation and 11× faster generation speed than GPU implementation.

show abstract

“…In [17], Chen et al used batch processing to maximise weights reuse in ConvNet layers across multiple inputs. [18] and [19] are more similar to our approach in presenting automated flows for mapping ConvNets to FPGAs. Both frameworks optimise for throughput and employ favourable batch sizes, with [19] also aiming to keep the batch size small.…”

Section: Performance Comparisonmentioning

confidence: 99%

Latency-driven design for FPGA-based convolutional neural networks

Venieris

Bouganis

2017

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

View full text Add to dashboard Cite

Abstract-In recent years, Convolutional Neural Networks (ConvNets) have become the quintessential component of several state-of-the-art Artificial Intelligence tasks. Across the spectrum of applications, the performance needs vary significantly, from high-throughput image recognition to the very low-latency requirements of autonomous cars. In this context, FPGAs can provide a potential platform that can be optimally configured based on different performance requirements. However, with the increasing complexity of ConvNet models, the architectural design space becomes overwhelmingly large, asking for principled design flows that address the application-level needs. This paper presents a latency-driven design methodology for mapping ConvNets on FPGAs. The proposed design flow employs novel transformations over a Synchronous Dataflow-based modelling framework together with a latency-centric optimisation procedure in order to efficiently explore the design space targeting low-latency designs. Quantitative evaluation shows large improvements in latency when latency-driven optimisation is in place yielding designs that improve the latency of AlexNet by 73.54× and VGG16 by 5.61× over throughput-optimised designs.

show abstract

From high-level deep neural models to FPGAs

Cited by 392 publications

References 25 publications

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

Maximizing CNN Accelerator Efficiency Through Resource Partitioning

FastWave: Accelerating Autoregressive Convolutional Neural Networks on FPGA

Latency-driven design for FPGA-based convolutional neural networks

Contact Info

Product

Resources

About