Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM Compute Arrays

Agrawal, Amogh; Jaiswal, Akhilesh; Roy, Deboleena; Han, Bing; Srinivasan, Gopalakrishnan; Ankit, Aayush; Roy, Kaushik

doi:10.1109/tcsi.2019.2907488

Cited by 84 publications

(39 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A majority of current hardware implementations are variants of von-Neumann machines [1]. In such machines, the memory and computation blocks are separate.…”

Section: Efficiency Analysismentioning

confidence: 99%

HadaNets: Flexible Quantization Strategies for Neural Networks

Akhauri¹

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

On-board processing elements on UAVs are currently inadequate for training and inference of Deep Neural Networks. This is largely due to the energy consumption of memory accesses in such a network. HadaNets introduce a flexible train-from-scratch tensor quantization scheme by pairing a full precision tensor to a binary tensor in the form of a Hadamard product. Unlike wider reduced precision neural network models, we preserve the train-time parameter count, thus out-performing XNOR-Nets without a traintime memory penalty. Such training routines could see great utility in semi-supervised online learning tasks. Our method also offers advantages in model compression, as we reduce the model size of ResNet-18 by 7.43× with respect to a full precision model without utilizing any other compression techniques. We also demonstrate a 'Hadamard Binary Matrix Multiply' kernel, which delivers a 10-fold increase in performance over full precision matrix multiplication with a similarly optimized kernel.

show abstract

“…A majority of current hardware implementations are variants of von-Neumann machines [1]. In such machines, the memory and computation blocks are separate.…”

Section: Efficiency Analysismentioning

confidence: 99%

HadaNets: Flexible Quantization Strategies for Neural Networks

Akhauri¹

2019

2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

View full text Add to dashboard Cite

show abstract

“…This work performs 4096 operations per cycle and achieves a throughput of 278.2 GOPS; the 6T bit-cell is susceptible to the write disturb. This problem can be addressed by adding more transistors or capacitors into the bit-cell, such as Xcel-RAM [20], XNOR-SRAM [22], and C3SRAM [23]. The Xcel-RAM approach [20] uses 10T bit-cell and performs binary MAC operation between the weight and input that are stored in two different rows.…”

Section: Introductionmentioning

confidence: 99%

10T SRAM Computing-in-Memory Macros for Binary and Multibit MAC Operation of DNN Edge Processors

2021

View full text Add to dashboard Cite

Computing-in-memory (CIM) is a promising approach to reduce latency and improve the energy efficiency of the multiply-and-accumulate (MAC) operation under a memory wall constraint for artificial intelligence (AI) edge processors. This paper proposes an approach focusing on scalable CIM designs using a new ten-transistor (10T) static random access memory (SRAM) bit-cell. Using the proposed 10T SRAM bit-cell, we present two SRAM-based CIM (SRAM-CIM) macros supporting multibit and binary MAC operations. The first design achieves fully parallel computing and high throughput using 32 parallel binary MAC operations. Advanced circuit techniques such as an input-dependent dynamic reference generator and an input-boosted sense amplifier are presented. Fabricated in 28 nm CMOS process, this design achieves 409.6 GOPS throughput, 1001.7 TOPS/W energy efficiency, and a 169.9 TOPS/mm 2 throughput area efficiency. The proposed approach effectively solves previous problems such as writing disturb, throughput, and the power consumption of an analog to digital converter (ADC). The second design supports multibit MAC operation (4-b weight, 4-b input, and 8-b output) to increase the inference accuracy. We propose an architecture that divides 4-b weight and 4-b input multiplication to four 2-b multiplication in parallel, which increases the signal margin by 16× compared to conventional 4-b multiplication. Besides, the capacitive digital-to-analog converter (CDAC) area issue is effectively addressed using the intrinsic bit-line capacitance existing in the SRAM-CIM architecture. The proposed approach of realizing four 2-b parallel multiplication using the CDAC is successfully demonstrated with a modified LeNet-5 neural network. These results demonstrate that the proposed 10T bit-cell is promising for realizing robust and scalable SRAM-CIM designs, which is essential for realizing fully parallel edge computing.INDEX TERMS computing-in-memory, static random access memory, deep neural network, machine learning, edge processor. * value when technology scaling factor is used. ** result when CONV1 and FL7 layers are implemented in the SRAM-CIM.

show abstract

“…In addition, BNNs replace expensive MACs with bitwise XNOR followed by population count (popcount) computations; XNOR followed by popcount computation is called as XNOR-and-accumulation (XAC). Thus, BNNs are known to be suitable for resource-and energy-constrained embedded systems compared to CNNs, by reducing the computational complexity as well as the memory footprint with minimal degradation in accuracy (less than 10% [2]).…”

Section: Introductionmentioning

confidence: 99%

“…Thanks to both (1) and (2), CiM_SRAM (M3D_4L) had smaller subarrays, which eventually reduced the length of the routing wires between subarrays.…”

mentioning

confidence: 99%

A System-Level Exploration of Binary Neural Network Accelerators with Monolithic 3D Based Compute-in-Memory SRAM

2021

View full text Add to dashboard Cite

Binary neural networks (BNNs) are adequate for energy-constrained embedded systems thanks to binarized parameters. Several researchers have proposed the compute-in-memory (CiM) SRAMs for XNOR-and-accumulation computations (XACs) in BNNs by adding additional transistors to the conventional 6T SRAM, which reduce the latency and energy of the data movements. However, due to the additional transistors, the CiM SRAMs suffer from larger area and longer wires than the conventional 6T SRAMs. Meanwhile, monolithic 3D (M3D) integration enables fine-grained 3D integration, reducing the 2D wire length in small functional units. In this paper, we propose a BNN accelerator (BNN_Accel), composed of a 9T CiM SRAM (CiM_SRAM), input buffer, and global periphery logic, to execute the computations in the binarized convolution layers of BNNs. We also propose CiM_SRAM with the subarray-level M3D integration (as well as the transistor-level M3D integration), which reduces the wire latency and energy compared to the 2D planar CiM_SRAM. Across the binarized convolution layers, our simulation results show that BNN_Accel with the 4-layer CiM_SRAM reduces the average execution time and energy by 39.9% and 23.2%, respectively, compared to BNN_Accel with the 2D planar CiM_SRAM.

show abstract

Xcel-RAM: Accelerating Binary Neural Networks in High-Throughput SRAM Compute Arrays

Cited by 84 publications

References 35 publications

HadaNets: Flexible Quantization Strategies for Neural Networks

HadaNets: Flexible Quantization Strategies for Neural Networks

10T SRAM Computing-in-Memory Macros for Binary and Multibit MAC Operation of DNN Edge Processors

A System-Level Exploration of Binary Neural Network Accelerators with Monolithic 3D Based Compute-in-Memory SRAM

Contact Info

Product

Resources

About