Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs

Boutros, Andrew; Yazdanshenas, Sadegh; Betz, Vaughn

doi:10.1109/fpl.2018.00014

Cited by 57 publications

(32 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the proposed column-wise MVM, one column of the weights matrix naturally shares the same element of the input vector, which helps us to pack four 8bit or ten 2-bit multiplications into one DSP block on Intel FPGAs [47] to reduce the hardware resources. Moreover, this would not be a restriction (and will come at lower cost) if we use a novel DSP similar to what was proposed in [48] and will be adopted in the next generation Agilex devices [49].…”

Section: Low Precision Multiplications With Dsp Block Sharingmentioning

confidence: 99%

Recurrent Neural Networks With Column-Wise Matrix–Vector Multiplication on FPGAs

Que

Nakahara

Nurvitadhi

et al. 2022

IEEE Trans. VLSI Syst.

Self Cite

View full text Add to dashboard Cite

This paper presents a reconfigurable accelerator for REcurrent Neural networks with fine-grained cOlumn-Wise matrix-vector multiplicatioN (RENOWN). We propose a novel latency-hiding architecture for RNN acceleration using columnwise matrix-vector multiplication instead of the state-of-the-art row-wise operation. This hardware architecture can eliminate data dependencies to improve the throughput of RNN inference systems. Besides, we introduce a configurable checkerboard tiling strategy which allows large weight matrices, while incorporating various configurations of element-based parallelism and vectorbased parallelism. These optimizations improve the exploitation of parallelism to increase hardware utilization and enhance system throughput. Evaluation results show that our design can achieve over 29.6 TOPS which would be among the highest for FPGA-based RNN designs. Compared to state-of-the-art accelerators on FPGAs, our design achieves 3.7 to 14.8 times better performance and has the highest hardware utilization.

show abstract

Section: Low Precision Multiplications With Dsp Block Sharingmentioning

confidence: 99%

Recurrent Neural Networks With Column-Wise Matrix–Vector Multiplication on FPGAs

Que

Nakahara

Nurvitadhi

et al. 2022

IEEE Trans. VLSI Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…When using lower precisions on FPGAs, many authors have implemented multipliers using LUTs instead of DSPs to achieve higher resource e ciency. Boutros et al [15] proposed the enhancement of DSP blocks to support low-precision MACs with some 12% area overhead and no drop in achievable frequency. One such enhanced DSP can perform one 27 × 27 or two 18 × 19, four 9 × 9 or eight 4 × 4 parallel MAC(s).…”

Section: Fixed-point Representationmentioning

confidence: 99%

Deep Neural Network Approximation for Custom Hardware

et al. 2019

View full text Add to dashboard Cite

LondonDeep neural networks have proven to be particularly e ective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardwareoriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy e ciency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-e cient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their e ectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. is article represents the rst survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the eld.

show abstract

“…3) Many overflow cases after adding error-reduction term. SIMD Accurate/Approximate Multiplier: Authors in [6,19] have shown performance/energy improvements in FPGA-based DNNs by modifying ASIC-based DSP block to perform double approximate multiplications with a common operand. Recently, [23] has proposed an approximate SIMD design (using 8x8 truncated multipliers) for ASIC platforms.…”

Section: Related Workmentioning

confidence: 99%

“…Nevertheless, in spite of their advantages, hosting off-the-shelf fixed-precision DSP blocks falls short on fulfilling design requirements in a variety of domains. Beside being unable to perform division, some shortcomings that testify on their inefficiency are: 1) their fixed locations in FPGAs impose routing complexity and often results in degraded performance of some circuits [17] (and Viterbi decoder, Reed-Solomon and JPEG encoders discussed in [30]); 2) unable to be efficiently-utilized for multiplication precision below 18-bit [6,19] (the comparable performance and better energy-efficiency of small-scale LUT-based multipliers over DSP blocks further encourages their deployment in e.g. neural networks) 3) their limited ratio versus LUTs (<0.001) in multiplication-intensive applications or concurrently executing programs.…”

Section: Introductionmentioning

confidence: 99%

SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy

Ebrahimi

Ullah

Kumar

2020

Proceedings of the 2020 on Great Lakes Symposium on VLSI

View full text Add to dashboard Cite

The ever-increasing quest for data-level parallelism and variable precision in ubiquitous multimedia and Deep Neural Network (DNN) applications has motivated the use of Single Instruction, Multiple Data (SIMD) architectures. To alleviate energy as their main resource constraint, approximate computing has re-emerged, albeit mainly specialized for their Application-Specific Integrated Circuit (ASIC) implementations. This paper, presents for the first time, an SIMD architecture based on novel multiplier and divider with tunable accuracy, targeted for Field-Programmable Gate Arrays (FPGAs). The proposed hybrid architecture implements Mitchell's algorithms and supports precision variability from 8 to 32 bits. Experimental results obtained from Vivado, multimedia and DNN applications indicate superiority of proposed architecture (both SISD and SIMD) over accurate and state-of-the-art approximate counterparts. In particular, the proposed SISD divider outperforms the accurate Intellectual Property (IP) divider provided by Xilinx with 4× higher speed and 4.6×less energy and tolerating only <0.8% error. Moreover, the proposed SIMD multiplier-divider supersede accurate SIMD multiplier by achieving up to 26%, 45%, 36%, and 56% improvement in area, throughput, power, and energy, respectively.

show abstract

Embracing Diversity: Enhanced DSP Blocks for Low-Precision Deep Learning on FPGAs

Cited by 57 publications

References 12 publications

Recurrent Neural Networks With Column-Wise Matrix–Vector Multiplication on FPGAs

Recurrent Neural Networks With Column-Wise Matrix–Vector Multiplication on FPGAs

Deep Neural Network Approximation for Custom Hardware

SIMDive: Approximate SIMD Soft Multiplier-Divider for FPGAs with Tunable Accuracy

Contact Info

Product

Resources

About