A High-Performance Multiply-Accumulate Unit by Integrating Additions and Accumulations Into Partial Product Reduction Process

Tung, Che-Wei; Huang, Shih-Hsu

doi:10.1109/access.2020.2992286

Cited by 27 publications

(24 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Note that the core of convolution operation is multiplication and accumulation. Therefore, in the SIMD architecture, multiply-accumulate (MAC) engines [28][29][30] are used to support convolution operations between input activations and kernel weights. No matter if a CNN is sparse or not, the compression format cannot be directly applied to the SIMD architecture; otherwise, irregularly distributed nonzero values will break the alignment of input activations and kernel weights.…”

Section: Related Workmentioning

confidence: 99%

Block-Based Compression and Corresponding Hardware Circuits for Sparse Activations

Weng

Huang

Kao

2021

Sensors

Self Cite

View full text Add to dashboard Cite

In a CNN (convolutional neural network) accelerator, to reduce memory traffic and power consumption, there is a need to exploit the sparsity of activation values. Therefore, some research efforts have been paid to skip ineffectual computations (i.e., multiplications by zero). Different from previous works, in this paper, we point out the similarity of activation values: (1) in the same layer of a CNN model, most feature maps are either highly dense or highly sparse; (2) in the same layer of a CNN model, feature maps in different channels are often similar. Based on the two observations, we propose a block-based compression approach, which utilizes both the sparsity and the similarity of activation values to further reduce the data volume. Moreover, we also design an encoder, a decoder and an indexing module to support the proposed approach. The encoder is used to translate output activations into the proposed block-based compression format, while both the decoder and the indexing module are used to align nonzero values for effectual computations. Compared with previous works, benchmark data consistently show that the proposed approach can greatly reduce both memory traffic and power consumption.

show abstract

Section: Related Workmentioning

confidence: 99%

Block-Based Compression and Corresponding Hardware Circuits for Sparse Activations

Weng

Huang

Kao

2021

Sensors

Self Cite

View full text Add to dashboard Cite

show abstract

“…Hardware simulation performed on FPGA Artix xc7a200tffg1156-3 in Xilinx Vivado 18.3 using VHDL hardware description language. The goal of the simulation was to compare the technical characteristics of the FIR DF implemented using known architectures in PNS [30] and in RNS [11,14] with the FIR DF using the proposed architecture in RNS with different moduli sets. Table V shows results of hardware simulation of 15th order FIR DF with different bit width.…”

Section: Hardware Simulation Of Digital Filters In the Residue Numentioning

confidence: 99%

“…Comparison with the known method [11] based on RNS with 4 moduli showed that proposed method allows to increase the frequency of the 15th order FIR DF by 1.7-5.0 times and reduce hardware costs for its implementation by 1.5-4.8 times with increasing power consumption by 7% -30%. The proposed method with 5modulus RNS allows to increase the frequency of the 15th order FIR DF by 2.0-4.2 times and reduce the hardware costs for its implementation by 1.1-2.6 times, with increasing power consumption by 7% -33% compared to the known method [30] based on PNS. Comparison with the known method [11] based on RNS with 5 moduli showed that proposed method allows to increase the frequency of the 15th order FIR DF by 1.6-4.4 times and reduce hardware costs for its implementation by 1.8-2.5 times with increasing power consumption by 11% -41%.…”

Section: Hardware Simulation Of Digital Filters In the Residue Numentioning

confidence: 99%

High-Performance Digital Filtering on Truncated Multiply-Accumulate Units in the Residue Number System

et al. 2020

View full text Add to dashboard Cite

“…In recent years, different researchers have done several works [2][3][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. Reference [22] proposes a high throughput MAC architecture that promises the optimized area in 2007.…”

Section: Introduction To Multiply and Accumulate (Mac) Architecturementioning

confidence: 99%

An Evolutionary Normalization Algorithm for Signed Floating-Point Multiply-Accumulate Operation

Sarma¹,

Bhargava²,

Kotecha³

2022

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

In the era of digital signal processing, like graphics and computation systems, multiplication-accumulation is one of the prime operations. A MAC unit is a vital component of a digital system, like different Fast Fourier Transform (FFT) algorithms, convolution, image processing algorithms, etcetera. In the domain of digital signal processing, the use of normalization architecture is very vast. The main objective of using normalization is to perform comparison and shift operations. In this research paper, an evolutionary approach for designing an optimized normalization algorithm is proposed using basic logical blocks such as Multiplexer, Adder etc. The proposed normalization algorithm is further used in designing an 8 × 8 bit Signed Floating-Point Multiply-Accumulate (SFMAC) architecture. Since the SFMAC can accept an 8-bit significand and a 3-bit exponent, the input to the said architecture can be somewhere between −(7.96872) 10 to + (7.96872) 10 . The proposed architecture is designed and implemented using the Cadence Virtuoso using 90 and 130 nm technologies (in Generic Process Design Kit (GPDK) and Taiwan Semiconductor Manufacturing Company (TSMC), respectively). To reduce the power consumption of the proposed normalization architecture, techniques such as "block enabling" and "clock gating" are used rigorously. According to the analysis done on Cadence, the proposed architecture uses the least amount of power compared to its current predecessors.

show abstract

A High-Performance Multiply-Accumulate Unit by Integrating Additions and Accumulations Into Partial Product Reduction Process

Cited by 27 publications

References 30 publications

Block-Based Compression and Corresponding Hardware Circuits for Sparse Activations

Block-Based Compression and Corresponding Hardware Circuits for Sparse Activations

High-Performance Digital Filtering on Truncated Multiply-Accumulate Units in the Residue Number System

An Evolutionary Normalization Algorithm for Signed Floating-Point Multiply-Accumulate Operation

Contact Info

Product

Resources

About