A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

Hoang, Tung T.; Själander, Magnus; Larsson-Edefors, Per

doi:10.1109/tcsi.2010.2091191

Cited by 50 publications

(38 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The area of a PCM cell with an access transistor is ∼ 25 F 2 (where F corresponds to the minimum lithographic pitch in a technology node), which could be reduced to ∼ 6 F 2 with a suitable diode based access device 28 . On the other hand, one bit SRAM area is ≥ 120 F 2 and the area of a 16-bit multiply-accumulate (MAC) required for neural network architectures is at least three orders of magnitude higher 28,29 . This results in trade-offs between the number of parallel computing units and on-chip memory for hardware implementations of neural networks using conventional CMOS technology.…”

Section: Discussionmentioning

confidence: 99%

A phase-change memory model for neuromorphic computing

Nandakumar

Gallo

Boybat

et al. 2018

Journal of Applied Physics

133

View full text Add to dashboard Cite

Phase-change memory (PCM) is an emerging non-volatile memory technology that is based on the reversible and rapid phase transition between the amorphous and crystalline phases of certain phase-change materials. The ability to alter the conductance levels in a controllable way makes PCM devices particularly well-suited for synaptic realizations in neuromorphic computing. A key attribute that enables this application is the progressive crystallization of the phase-change material and subsequent increase in device conductance by the successive application of appropriate electrical pulses. There is significant inter and intra-device randomness associated with this cumulative conductance evolution and it is essential to develop a statistical model to capture this. PCM also exhibits a temporal evolution of the conductance values (drift) which could also influence applications in neuromorphic computing. In this paper, we have developed a statistical model that describes both the cumulative conductance evolution and conductance drift. This model is based on extensive characterization work on 10,000 memory devices. Finally, the model is used to simulate supervised training of both spiking and non-spiking artificial neuronal networks.

show abstract

Section: Discussionmentioning

confidence: 99%

A phase-change memory model for neuromorphic computing

Nandakumar

Gallo

Boybat

et al. 2018

Journal of Applied Physics

133

View full text Add to dashboard Cite

show abstract

“…The twin precision [Sjalander and Larsson-Edefors 2009] technique is used to optimize an n-bit multiplier, where the n-bit multiplier is used to compute two n/2-bit multiplications in parallel. A MAC architecture using twin precision multiplication is proposed in Hoang et al [2010]. A new MAC architecture using a radix-4 modified Booth algorithm for fixed point is proposed in Seo and Kim [2010].…”

Section: Mac Design From the Literaturementioning

confidence: 99%

An Efficient Hardware-Based Higher Radix Floating Point MAC Design

2014

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Technology Design and ManufacturingThis article proposes an effective way of implementing a multiply accumulate circuit (MAC) for high-speed floating point arithmetic operations. The real-world applications related to digital signal processing and the like demand high-performance computation with greater accuracy. In general, digital signals are represented as a sequence of signed/unsigned fixed/floating point numbers. The final result of a MAC operation can be computed by feeding the mantissa of the previous MAC result as one of the partial products to a Wallace tree multiplier or Braun multiplier. Thus, the separate accumulation circuit can be avoided by keeping the circuit depth still within the bounds of the Wallace tree multiplier, namely O(log 2 n), or Braun multiplier, namely O(n). In this article, three kinds of floating point MACs are proposed. The experimental results show 48.54% of improvement in worst path delay achieved by the proposed floating point MAC using a radix-2 Wallace structure compared with a conventional floating point MAC without a pipeline using a 45nm technology library. The same proposed design gives 39.92% of improvement in worst path delay without a pipeline using a radix-4 Braun structure as compared with a conventional design. In this article, a radix-32 Q 32.32 -formatbased floating point MAC is proposed using a Wallace tree/Braun multiplier. Also this article discusses the msb prediction problem and its solution in floating point arithmetic that is not available in modern fused multiply-add designs. The performance results show comparisons between the proposed floating point MAC with various floating point MAC designs for radix-2,-4,-8, and -16. The proposed design has lesser depth than a conventional floating point MAC as well as a lower area requirement than other ways of floating point MAC implementation, both with/without a pipeline. ACM Reference Format:Mohamed Asan Basiri M and Noor Mahammad Sk. 2014. An efficient hardware-based higher radix floating point MAC design.

show abstract

“…The multiplier in the MAC unit uses the Baugh-Wooley multiplier algorithm to generate the partial products, and reduces and reorganizes the partial products based on the high-performance multiplier tree scheme. 3 After the partial products are generated, they are clocked into the first stage of the pipeline and then made available to the carry-save adder (CSA). The CSA sums the partial products with the value in one of five selected accumulation registers.…”

Section: Datapath Descriptionmentioning

confidence: 99%

Digital pixel CMOS focal plane array with on-chip multiply accumulate units for low-latency image processing

et al. 2014

View full text Add to dashboard Cite

A digital pixel CMOS focal plane array has been developed to enable low latency implementations of image processing systems such as centroid trackers, Shack-Hartman wavefront sensors, and Fitts correlation trackers through the use of in-pixel digital signal processing (DSP) and generic parallel pipelined multiply accumulate (MAC) units. Light intensity digitization occurs at the pixel level, enabling in-pixel DSP and noiseless data transfer from the pixel array to the peripheral processing units. The pipelined processing of row and column image data prior to off chip readout reduces the required output bandwidth of the image sensor, thus reducing the latency of computations necessary to implement various image processing systems. Data volume reductions of over 80% lead to sub 10µs latency for completing various tracking and sensor algorithms. This paper details the architecture of the pixel-processing imager (PPI) and presents some initial results from a prototype device fabricated in a standard 65nm CMOS process hybridized to a commercial off-the-shelf short-wave infrared (SWIR) detector array.

show abstract

A High-Speed, Energy-Efficient Two-Cycle Multiply-Accumulate (MAC) Architecture and Its Application to a Double-Throughput MAC Unit

Cited by 50 publications

References 20 publications

A phase-change memory model for neuromorphic computing

A phase-change memory model for neuromorphic computing

An Efficient Hardware-Based Higher Radix Floating Point MAC Design

Digital pixel CMOS focal plane array with on-chip multiply accumulate units for low-latency image processing

Contact Info

Product

Resources

About