Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs

Walters, E. George

doi:10.3390/computers5040020

Cited by 43 publications

(14 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this case, the generate and propagate signals are functions of the five variables (w 0 , w 1 , x i , a i , 2a i ), and thus can be implemented with the same number of LUTs as with the signed 2-bit independent weights described before. However, in the case of the multiplication by the unsigned 2-bit weights {0, 1, 2, 3}, the generate and propagate signals are functions of the six variables w 0 , w 1 , x i , a i , 2a i , 3a i , since the 3× multiple is also needed, and cannot be implemented with a single LUT, However, using a modified Booth recoding algorithm [38] and the implementation method proposed in [39], it is possible to avoid the 3× multiple, and implement the addition of a variable using a 5-variable function with a single level of LUTs.…”

Section: A Hybrid Core For 8-bit Activations and 8/2-bit Weights -C8:82mentioning

confidence: 99%

A Configurable Architecture for Running Hybrid Convolutional Neural Networks in Low-Density FPGAs

et al. 2020

View full text Add to dashboard Cite

Convolutional neural networks have become the state of the art of machine learning for a vast set of applications, especially for image classification and object detection. There are several advantages to running inference on these models at the edge, including real-time performance and data privacy. The high computing and memory requirements of convolutional neural networks have been major obstacles to the broader deployment of CNNs on edge devices. Data quantization is an optimization method that reduces the number of bits used to represent weights and activations of a network model, minimizing storage requirements and computing complexity. Quantization can be applied at the layer level, by using different bit widths in different layers: this is called hybrid quantization. This article proposes a new efficient and configurable architecture for running CNNs with hybrid quantization in low-density Field-Programmable Gate Arrays (FPGAs) targeting edge devices. The architecture has been implemented on the Xilinx ZYNQ7020/45 devices and is running the AlexNet and VGG16 networks. Running AlexNet, the architecture has a throughput up to 508 images per second on the ZYNQ7020 device, and 1639 images per second on the ZYNQ7045 device. Considering VGG16, the architecture delivers up to 43 images per second on the ZYNQ7020 device, and 81 images per second on the ZYNQ7045 device. The proposed hybrid architecture achieves up to 13.7× improvement in performance compared to state-of-the-art solutions, with small accuracy degradation.

show abstract

Section: A Hybrid Core For 8-bit Activations and 8/2-bit Weights -C8:82mentioning

confidence: 99%

A Configurable Architecture for Running Hybrid Convolutional Neural Networks in Low-Density FPGAs

et al. 2020

View full text Add to dashboard Cite

show abstract

“…In [12], a multiplexer-based 8-bit multiplier is presented with 50 MHz frequency, whereas the proposed architecture achieves 320 MHz frequency for 16-bit multiplication. E. George Walters III presents array multipliers using six-input LUTs and shift register LUTs [13], whereas the research presented in this article presents those using four-input LUTs. The modern FPGAs have builtin multipliers in them but still the configurable multipliers using LUTs play a vital role in many applications due to their flexible size, placement and modification ability [13].…”

Section: Introductionmentioning

confidence: 99%

“…E. George Walters III presents array multipliers using six-input LUTs and shift register LUTs [13], whereas the research presented in this article presents those using four-input LUTs. The modern FPGAs have builtin multipliers in them but still the configurable multipliers using LUTs play a vital role in many applications due to their flexible size, placement and modification ability [13]. Many researchers have worked on the design of multipliers earlier, as reported in this section, but they have not explored the option of reusing the same resources using iterative methods.…”

Section: Introductionmentioning

confidence: 99%

An area-optimized N-bit multiplication technique using N/2-bit multiplication algorithm

et al. 2019

View full text Add to dashboard Cite

A unique design for an optimized N-bit multiplier is proposed and implemented which utilizes a modified divide-andconquer technique. The conventional technique requires four N/2-bit multipliers to perform N-bit multiplication, whereas the proposed design uses only one multiplier module in hardware to perform the functionality of four modules. It uses Dadda algorithm in its multiplier module. It has been implemented using Verilog HDL, and a good accuracy of results was observed in simulations which effectively verify its functionality. Design was also synthesized on various FPGAs including Spartan 3E, Virtex-5 and Virtex-7. Performance summary, after place and route, showed that the proposed approach significantly reduces hardware utilization. Furthermore, the proposed design is almost 75% more efficient in terms of resources utilization and operating frequency as compared to the conventional design.

show abstract

“…For the i th column of the adder, x i and y i are the bits of X and Y, respectively, c i is the carry-in bit, c i+1 is the carry-out bit and s i is the sum bit. The prop i signal must be set to x i ⊕ y 1 and the gen i signal can be set to either x i or y i to add x i and y i [14,16]. If x i and y i together are a function of five or fewer inputs, then the LUT6 can be configured as two LUT5s, generating either x i or y i at O5 and routing it to gen i , and generating x i ⊕ y i at O6 to drive prop i .…”

Section: Proposed Two-operand Addermentioning

confidence: 99%

“…This paper describes an approach that uses a novel two-operand addition circuit [14][15][16] that combines generation of a pre-computed partial product with addition of another value, similar to Wirthlin's work but optimized for Xilinx FPGAs with 6-input LUTs. A novel approach is used for the case where the constant is negative.…”

Section: Introductionmentioning

confidence: 99%

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Walters

2017

Electronics

Self Cite

View full text Add to dashboard Cite

Multiplication by a constant is a common operation for many signal, image, and video processing applications that are implemented in field-programmable gate arrays (FPGAs). Constant-coefficient multipliers (KCMs) are often implemented in the logic fabric using lookup tables (LUTs), reserving embedded hard multipliers for general-purpose multiplication. This paper describes a two-operand addition circuit from previous work and shows how it can be used to generate and add pre-computed partial products to implement KCMs. A novel method for pre-computing partial products for KCMs with a negative constant is also presented. These KCMs are then extended to have two to eight coefficients that may be selected by a control signal at runtime to implement time-multiplexed multiple-constant multiplication. Synthesis results show that proposed pipelined KCMs use 27.4% fewer LUTs on average and have a median LUT-delay product that is 12% lower than comparable LogiCORE IP KCMs. Proposed pipelined KCMs with two to eight selectable coefficients use 46% to 70% fewer LUTs than the best LogiCORE IP based alternative and most are faster than using a LogiCORE IP multiplier with a coefficient lookup function. They also outperform the state-of-the-art in the literature, using 22% to 57% fewer slices than the smallest pipelined adder graph (PAG) fusion designs and operate 7% to 30% faster than the fastest PAG fusion designs for the same operand size and number of selectable coefficients. For KCMs and KCMs with selectable coefficients of a given operand size, the placement and routing of LUTs remains the same for all positive and negative constant values, which is advantageous for runtime partial reconfiguration.

show abstract

Array Multipliers for High Throughput in Xilinx FPGAs with 6-Input LUTs

Cited by 43 publications

References 29 publications

A Configurable Architecture for Running Hybrid Convolutional Neural Networks in Low-Density FPGAs

A Configurable Architecture for Running Hybrid Convolutional Neural Networks in Low-Density FPGAs

An area-optimized N-bit multiplication technique using N/2-bit multiplication algorithm

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Contact Info

Product

Resources

About