Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks

Garland, James; Gregg, David

doi:10.1109/lca.2017.2656880

Cited by 28 publications

(9 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Garland et al, [11] show that they can vary the bit-width of their parallel accumulate shared MAC (PASM) between 4-bit and 32-bit in ASIC and maintain performance and accuracy while reducing power and area of the multiplier. In their follow up work, Garland et al, [12] show PASM can be implemented on an FPGA and vary the bit-width between INT8 and 32-bit, saving significant energy with only a slight increase in latency and no change in classification accuracy.…”

Section: Related Workmentioning

confidence: 99%

HOBFLOPS for CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

Garland

Gregg

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Low-precision floating-point (FP) can be highly effective for convolutional neural network (CNN) inference. Custom low-precision FP can be implemented in field programmable gate array (FPGA) and application-specific integrated circuit (ASIC) accelerators, but existing microprocessors do not generally support fast, custom precision FP. We propose hardware optimized bitslice-parallel floating-point operators (HOBFLOPS), a generator of efficient custom precision emulated bitslice-parallel software(C/C++) FP arithmetic. We generate custom-precision FP routines, optimized using a hardware synthesis design flow, to create circuits. We provide standard cell libraries matching the bitwise operations on the target microprocessor architecture and a code generator to translate the hardware circuits to bitslice software equivalents. We exploit bitslice parallelism to create a novel, very wide (32—512 element) vectorized CNN convolution for inference. On Arm and Intel processors, the multiply-accumulate (MAC) performance in CNN convolution of HOBFLOPS, Flexfloat, and Berkeley’s SoftFP are compared. HOBFLOPS outperforms Flexfloat by up to 10× on Intel AVX512. HOBFLOPS offers arbitrary-precision FP with custom range and precision, e . g ., HOBFLOPS9, which outperforms Flexfloat 9-bit on Arm Neon by 7×. HOBFLOPS allows researchers to prototype different levels of custom FP precision in the arithmetic of software CNN ac celerators. Furthermore, HOBFLOPS fast custom-precision FP CNNs may be valuable in cases where memory bandwidth is limited.

show abstract

Section: Related Workmentioning

confidence: 99%

HOBFLOPS for CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

Garland

Gregg

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…The 16-PAS-4-MAC also consumes 61% less leakage power, 70% less dynamic power and 70% less total power ( Figure 10). More details can be found in our original paper, [Garland and Gregg 2017].…”

Section: Evaluation Of Pasm As a Stand-alone Unitmentioning

confidence: 99%

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

Garland

Gregg

2018

ACM Trans. Archit. Code Optim.

Self Cite

View full text Add to dashboard Cite

Convolutional neural networks (CNNs) are one of the most successful machine learning techniques for image, voice and video processing. CNNs require large amounts of processing capacity and memory bandwidth. Hardware accelerators have been proposed for CNNs which typically contain large numbers of multiply-accumulate (MAC) units, the multipliers of which are large in integrated circuit (IC) gate count and power consumption. "Weight sharing" accelerators have been proposed where the full range of weight values in a trained CNN are compressed and put into bins and the bin index used to access the weight-shared value. We reduce power and area of the CNN by implementing parallel accumulate shared MAC (PASM) in a weight-shared CNN. PASM re-architects the MAC to instead count the frequency of each weight and place it in a bin. The accumulated value is computed in a subsequent multiply phase, significantly reducing gate count and power consumption of the CNN. In this paper, we implement PASM in a weight-shared CNN convolution hardware accelerator and analyze its effectiveness. Experiments show that for a clock speed 1GHz implemented on a 45nm ASIC process our approach results in fewer gates, smaller logic, and reduced power with only a slight increase in latency. We also show that the same weight-shared-with-PASM CNN accelerator can be implemented in resource-constrained FPGAs, where the FPGA has limited numbers of digital signal processor (DSP) units to accelerate the MAC operations.

show abstract

“…1c are not used during 16 × 16 bit MAC mode. In most of the non-vector MAC designs [6][7][8][18][19][20][21], the flexibility to perform multiple MAC operations is absent. For example, second, third, and fourth quarters of Figs.…”

Section: Related Workmentioning

confidence: 99%

“…In [20], the previous MAC result is added along with the sum and carry from the last carry save stage of the Wallace tree multiplier. In [21], memory‐based conventional MAC is designed, where one of the operand will be sent to the multiplier from the memory. In most of the above‐mentioned existing

n \times n

bits vector MACs ([14–16]), the hardware utilisation is less during

false(n / 2 false) \times false(n / 2 false)

bits or

n \times false(n / 2 false)

bits mode of operations.…”

Section: Introductionmentioning

confidence: 99%

Quadruple throughput fixed point quarter precision multiply accumulate circuit design

Basiri

Mohammad

2017

IET Computers & Digital Techniques

View full text Add to dashboard Cite

This study proposes an efficient very large scale integration (VLSI) architecture for quadruple throughput fixed point multiply accumulate circuit (MAC). The proposed n × n bits MAC is used to perform one n × n bits or two n × (n/2) bits or four (n/2) × (n/2) bits MAC operations in parallel. The objective of the proposed MAC is to improve throughput of the existing MAC designs. The proposed and existing designs are implemented by 45 nm CMOS TSMC library and the results show that the proposed architecture achieves better improvement in throughput than existing designs. For example, the proposed 32 × 32 bits MAC architecture achieves 60.4% of improvement in throughput over existing array multiplier-based double throughput MAC.

show abstract

Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks

Cited by 28 publications

References 8 publications

HOBFLOPS for CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

HOBFLOPS for CNNs: Hardware Optimized Bitslice-Parallel Floating-Point Operations for Convolutional Neural Networks

Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing

Quadruple throughput fixed point quarter precision multiply accumulate circuit design

Contact Info

Product

Resources

About