MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks

Kim, Byeongho; Chung, Jongwook; Lee, Eojin; Jung, Wonkyung; Lee, Sunjung; Choi, Jaewan; Park, Jaehyun; Wi, Minbok; Lee, Sukhan; Ahn, Jung Ho

doi:10.1109/tc.2020.2984496

Cited by 22 publications

(22 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There are also designs placing accelerators at bank (group) level to further exploit the inherent parallelism in DRAM devices [10,16,[26][27][28]31]. However, these designs are mostly used for elementwise or multiply-and-accumulate operations because they require all input operands to sit within a specific bank (group).…”

Section: Related Workmentioning

confidence: 99%

MeNDA

Feng

Chen

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Near-memory processing has been extensively studied to optimize memory intensive workloads. However, none of the proposed designs address sparse matrix transposition, an important building block in sparse linear algebra applications. Prior work shows that sparse matrix transposition does not scale as well as other sparse primitives such as sparse matrix vector multiplication (SpMV) and hence has become a growing bottleneck in common applications. Sparse matrix transposition is highly memory intensive but low in computational intensity, making it a promising candidate for near-memory processing. In this work, we propose MeNDA, a scalable near-DRAM multi-way merge accelerator that eliminates the off-chip memory interface bottleneck and exposes the high internal memory bandwidth to improve performance and reduce energy consumption for sparse matrix transposition. MeNDA adopts a merge sort based algorithm, exploiting spatial locality, and proposes a near-memory processing unit (PU) featuring a high-performance hardware merge tree. Because of the wide application of merge sort in sparse linear algebra, MeNDA is an extensible solution that can be easily adapted to support other sparse primitives such as SpMV. Techniques including seamless back-to-back merge sort, stall reducing prefetching and request coalescing are further explored to take full advantage of the increased system memory bandwidth. Compared to two state-of-the-art implementations of sparse matrix transposition on a CPU and a sparse library on a GPU, MeNDA is able to achieve a speedup of 19.1×, 12.0×, and 7.7×, respectively. MeNDA also shows an efficiency gain of 3.8× over a recent SpMV accelerator integrated with HBM. Incurring a power consumption

show abstract

Section: Related Workmentioning

confidence: 99%

MeNDA

Feng

Chen

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…In addition, as TPU v4 [28] reuses hardware designs of TPU v3 except for several components such as on-chip memory capacity, on-chip interconnect, and DMA, the VU of TPU v4 is the same structure as that of TPU v3. There have been processing-near-DRAM studies [10,14,31] to provide high of-chip memory bandwidth during inference. Because [10,14] use datalow architecture such as Eyeriss v1 [7] and systolic array, they still do not process DW-CONV eiciently.…”

Section: Related Workmentioning

confidence: 99%

“…Because [10,14] use datalow architecture such as Eyeriss v1 [7] and systolic array, they still do not process DW-CONV eiciently. In contrast, [31] has advantages for memory-intensive operations but has weaknesses for compute-intensive ST-CONV operations. Prior works supporting both ST-and DW-CONV: Previous architectural solutions have been mainly proposed to process both ST-and DW-CONV in an MU.…”

Section: Related Workmentioning

confidence: 99%

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

Lee¹,

Choi²,

Jung³

et al. 2022

ACM Trans. Des. Autom. Electron. Syst.

Self Cite

View full text Add to dashboard Cite

Mobile and edge devices become common platforms for inferring convolutional neural networks (CNNs) due to superior privacy and service quality. To reduce the computational costs of convolution (CONV), recent CNN models adopt depth-wise CONV (DW-CONV) and Squeeze-and-Excitation (SE). However, existing area-efficient CNN accelerators are sub-optimal for these latest CNN models because they were mainly optimized for compute-intensive standard CONV layers with abundant data reuse that can be pipelined with activation and normalization operations. In contrast, DW-CONV and SE are memory-intensive with limited data reuse. The latter also strongly depends on the nearby CONV layers, making an effective pipelining a daunting task. Therefore, DW-CONV and SE only occupy 10% of entire operations but become memory bandwidth bound, spending more than 60% of the processing time in systolic-array-based accelerators. We propose a CNN acceleration architecture called MVP, which efficiently processes both compute- and memory-intensive operations with a small area overhead on top of the baseline systolic-array-based architecture. We suggest a specialized vector unit tailored for processing DW-CONV, including multipliers, adder trees, and multi-banked buffers to meet the high memory bandwidth requirement. We augment the unified buffer with tiny processing elements to smoothly pipeline SE with the subsequent CONV, enabling concurrent processing of DW-CONV with standard CONV, thereby achieving the maximum utilization of arithmetic units. Our evaluation shows that MVP improves performance by 2.6 × and reduces energy by 47% on average for EfficientNet-B0/B4/B7, MnasNet, and MobileNet-V1/V2 with only a 9% area overhead compared to the baseline.

show abstract

“…Contrary to many PIM work that perform MAC operation within DRAM [34], [53], [65], we leave the MAC operations entirely to the host NPU. Instead, GradPIM performs the parameter update phase, following the observations from Section II.…”

Section: Gradpimmentioning

confidence: 99%

“…To consider the power budget, we first estimated the maximum power of a DRAM channel as done by [53], by performing sequential reads while keeping the tFAW and tRRD constraints. Then we have scaled tFAW and tRRD so that performing consecutive PIM operations would yield the same maximum power.…”

Section: Timing Considerationsmentioning

confidence: 99%

GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent

Kim

Park

Kim

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we present GradPIM, a processingin-memory architecture which accelerates parameter updates of deep neural networks training. As one of processing-in-memory techniques that could be realized in the near future, we propose an incremental, simple architectural design that does not invade the existing memory protocol. Extending DDR4 SDRAM to utilize bank-group parallelism makes our operation designs in processing-in-memory (PIM) module efficient in terms of hardware cost and performance. Our experimental results show that the proposed architecture can improve the performance of DNN training and greatly reduce memory bandwidth requirement while posing only a minimal amount of overhead to the protocol and DRAM area.

show abstract

MViD: Sparse Matrix-Vector Multiplication in Mobile DRAM for Accelerating Recurrent Neural Networks

Cited by 22 publications

References 33 publications

MeNDA

MeNDA

MVP: An Efficient CNN Accelerator with Matrix, Vector, and Processing-Near-Memory Units

GradPIM: A Practical Processing-in-DRAM Architecture for Gradient Descent

Contact Info

Product

Resources

About