An Energy-Efficient Quantized and Regularized Training Framework For Processing-In-Memory Accelerators

Sun, Hanbo; Zhu, Zhenhua; Cai, Yi; Chen, Xiaoming; Wang, Yu; Yang, Huazhong

doi:10.1109/asp-dac47756.2020.9045192

Cited by 30 publications

(9 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PIM for DL training. Another body of works leverages PIM techniques to accelerate DL training [196,[247][248][249][250][251][252][253][254][255][256][257][258]. These works mainly utilize the analog computation capabilities of non-volatile memory (NVM) technologies to implement training of deep neural networks [247-250, 252, 254, 255, 257].…”

Section: Related Workmentioning

confidence: 99%

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Gómez-Luna¹,

Guo²,

Brocard³

et al. 2022

Preprint

View full text Add to dashboard Cite

Training machine learning algorithms is a computationally intensive process, which is frequently memory-bound due to repeatedly accessing large training datasets. As a result, processor-centric systems (e.g., CPU, GPU) suffer from costly data movement between memory units and processing units, which consumes large amounts of energy and execution cycles. Memory-centric computing systems, i.e., computing systems with processing-in-memory (PIM) capabilities, can alleviate this data movement bottleneck.Our goal is to understand the potential of modern generalpurpose PIM architectures to accelerate machine learning training. To do so, we (1) implement several representative classic machine learning algorithms (namely, linear regression, logistic regression, decision tree, K-Means clustering) on a real-world general-purpose PIM architecture, (2) rigorously evaluate and characterize them in terms of accuracy, performance and scaling, and (3) compare to their counterpart implementations on CPU and GPU. Our experimental evaluation on a real memory-centric computing system with more than 2500 PIM cores shows that general-purpose PIM architectures can greatly accelerate memory-bound machine learning workloads, when the necessary operations and datatypes are natively supported by PIM hardware. For example, our PIM implementation of decision tree is 27× faster than a state-of-the-art CPU version on an 8-core Intel Xeon, and 1.34× faster than a state-of-theart GPU version on an NVIDIA A100. Our K-Means clustering on PIM is 2.8× and 3.2× than state-of-the-art CPU and GPU versions, respectively.To our knowledge, our work is the first one to evaluate training of machine learning algorithms on a real-world general-purpose PIM architecture. We conclude this paper with several key observations, takeaways, and recommendations that can inspire users of machine learning workloads, programmers of PIM architectures, and hardware designers and architects of future memory-centric computing systems.

show abstract

Section: Related Workmentioning

confidence: 99%

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Gómez-Luna¹,

Guo²,

Brocard³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Other notable AiMC designs are not limited by such constrains, and still allow lexibility regarding the quantization range of the ADC. As an example, [34] uses a current-sensing approach based on an RRAM array whose load resistance consists of a programmable RRAM cell used to dynamically rescale the summation line current range to a ixed voltage range. Similarly, [32] is a charge-discharge based SRAM AiMC design that includes a conigurable replica SRAM column used to provide a voltage reference to dynamically change the quantization range of the ADC.…”

Section: Range Determinationmentioning

confidence: 99%

“…Hardware based methods to control the scaling factor of AiMC are not limited to charge based implementations. Notably, the current-sensing approach based on an RRAM array introced in [34] uses a load resistance consisting of a programmable RRAM cell, to dynamically re-scale the summation line current range to a ixed voltage range. This observation, along with previously mentioned approaches, testiie that dynamic scaling can be generalized, across a wide range of device types, for both current-sensing and charge based AiMC implementations.…”

Section: Hardware Supported Scalingmentioning

confidence: 99%

Dynamic Quantization Range Control for Analog-in-Memory Neural Networks Acceleration

Laubeuf

Doevenspeck

Papistas

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

View full text Add to dashboard Cite

Analog in Memory Computing (AiMC) based neural network acceleration is a promising solution to increase the energy efficiency of deep neural networks deployment. However, the quantization requirements of these analog systems are not compatible with state of the art neural network quantization techniques. Indeed, while the quantization of the weights and activations is considered by modern deep neural network quantization techniques, AiMC accelerators also impose the quantization of each Matrix Vector Multiplication (MVM) result. In most demonstrated AiMC implementations, the quantization range of MVM results is considered a fixed parameter of the accelerator. This work demonstrates that dynamic control over this quantization range is possible but also desirable for analog neural networks acceleration. An AiMC compatible quantization flow coupled with an hardware aware quantization range driving technique is introduced to fully exploit these dynamic ranges. Using CIFAR-10 and ImageNet as benchmarks, the proposed solution results in networks that are both more accurate and more robust to the inherent vulnerability of analog circuits than fixed quantization range based approaches.

show abstract

“…However, such methods will be critically influenced by the extreme data, and discard abundant information on small but major values. Some researchers use nonlinear activation quantization methods to compensate for the extreme values and obtain better performance, but require to generate nonuniform reference signals, which introduce a complex fabrication process and are not friendly for the hardware implementation of ADCs/DACs (Sun et al, 2020). Therefore, a uniform and clipped activation quantization strategy is used to better match the characteristics of ADC/DAC implementations and ease the hardware design.…”

Section: Quantization Precision and Hardware Overheadmentioning

confidence: 99%

“…The sparse characteristics of the parameters in the neural networks widely existed due to the advantages of the rectified linear unit (ReLU) activation function and regularization training methods. Enhancing the sparsity of the weights and the activations can be considered equally as increasing the ratio of highresistance-state devices and low-amplitude voltage signals in the hardware implementation, which dominate in reducing the energy consumption (Sun et al, 2020). However, the sparsity of the activations can be further utilized to reduce the required ADC precision and the corresponding energy consumption.…”

Section: Introductionmentioning

confidence: 99%

Quantization and sparsity-aware processing for energy-efficient NVM-based convolutional neural networks

Bao¹,

Qin²,

Chen³

et al. 2022

Front.Electron.

View full text Add to dashboard Cite

Nonvolatile memory (NVM)-based convolutional neural networks (NvCNNs) have received widespread attention as a promising solution for hardware edge intelligence. However, there still exist many challenges in the resource-constrained conditions, such as the limitations of the hardware precision and cost and, especially, the large overhead of the analog-to-digital converters (ADCs). In this study, we systematically analyze the performance of NvCNNs and the hardware restrictions with quantization in both weight and activation and propose the corresponding requirements of NVM devices and peripheral circuits for multiply–accumulate (MAC) units. In addition, we put forward an in situ sparsity-aware processing method that exploits the sparsity of the network and the device array characteristics to further improve the energy efficiency of quantized NvCNNs. Our results suggest that the 4-bit-weight and 3-bit-activation (W4A3) design demonstrates the optimal compromise between the network performance and hardware overhead, achieving 98.82% accuracy for the Modified National Institute of Standards and Technology database (MNIST) classification task. Moreover, higher-precision designs will claim more restrictive requirements for hardware nonidealities including the variations of NVM devices and the nonlinearities of the converters. Moreover, the sparsity-aware processing method can obtain 79%/53% ADC energy reduction and 2.98×/1.15× energy efficiency improvement based on the W8A8/W4A3 quantization design with an array size of 128 × 128.

show abstract

An Energy-Efficient Quantized and Regularized Training Framework For Processing-In-Memory Accelerators

Cited by 30 publications

References 12 publications

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

An Experimental Evaluation of Machine Learning Training on a Real Processing-in-Memory System

Dynamic Quantization Range Control for Analog-in-Memory Neural Networks Acceleration

Quantization and sparsity-aware processing for energy-efficient NVM-based convolutional neural networks

Contact Info

Product

Resources

About