Design and Evaluation of a Processing-in-Memory Architecture for the Smart Memory Cube

Azarkhish, Erfan; Rossi, Davide; Loi, Igor; Benini, Luca

doi:10.1007/978-3-319-30695-7_2

Cited by 39 publications

(45 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Year Category NMC capabilities Sinuca [88] 2015 Cycle-Accurate Yes HMC-SIM [89] 2016 Cycle-Accurate Limited CasHMC [90] 2016 Cycle-Accurate No SMC [31] 2016 Cycle-Accurate Yes CLAPPS [82] 2017 Cycle-Accurate Yes Ramulator-PIM [91] 2019 Cycle-Accurate Yes 2) Simulation Based Modeling allows to achieve more accurate performance numbers. Architects often resort to modeling the entire micro-architecture precisely.…”

Section: Simulatormentioning

confidence: 99%

Near-memory computing: Past, present, and future

Singh

Chelini

Corda

et al. 2019

Microprocessors and Microsystems

View full text Add to dashboard Cite

The conventional approach of moving data to the CPU for computation has become a significant performance bottleneck for emerging scale-out data-intensive applications due to their limited data reuse. At the same time, the advancement in 3D integration technologies has made the decade-old concept of coupling compute units close to the memory -called nearmemory computing (NMC) -more viable. Processing right at the "home" of data can significantly diminish the data movement problem of data-intensive applications.In this paper, we survey the prior art on NMC across various dimensions (architecture, applications, tools, etc.) and identify the key challenges and open issues with future research directions. We also provide a glimpse of our approach to near-memory computing that includes i) NMC specific microarchitecture independent application characterization ii) a compiler framework to offload the NMC kernels on our target NMC platform and iii) an analytical model to evaluate the potential of NMC.

show abstract

Section: Simulatormentioning

confidence: 99%

Near-memory computing: Past, present, and future

Singh

Chelini

Corda

et al. 2019

Microprocessors and Microsystems

View full text Add to dashboard Cite

show abstract

“…Even though Conv-Nets are computation-intensive workloads and extremely high energy-efficiencies have been previously reported for their ASIC implementations [18] [19] [17], the scalability and energy-efficiency of modern ConvNets are ultimately bound by the main memory where their parameters and channels need to be stored (See subsection II-B). This makes them interesting candidates for near memory computation, offering them plenty of bandwidth at a lower cost and without much buffering compared to off-chip accelerators due to lower memory access latency (A consequence of the Little's law 1 [24]).…”

Section: Introductionmentioning

confidence: 99%

Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

Azarkhish

Rossi

Loi

et al. 2018

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

Abstract-High-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our codesign approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8% of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.

show abstract

“…Within this organization, the processor can be implemented as some sophisticated standard superscalar processor and may contain a vector unit, as is the case with the intelligent RAM (IRAM), [17]. The integrated memory into the chip can be realized as SRAM or embedded DRAM, which is basically accessed through the processor's cache memory, [23]. Considering that the processor and the memory are physically close, the integrated chip can achieve higher memory bandwidth, reduced memory latency and decreased power consumption, compared to today's conventional memory chips, and cache memories in multi-processing systems, [12].…”

Section: Comparative Analysismentioning

confidence: 99%

“…Other scientists have suggested innovations into DRAM memory architecture itself, [31]. This research has resulted with several DRAM solutions, including: asymmetric DRAM (provides non-uniform access to DRAM banks), Reduced Latency DRAM (RLDRAM), Fast Cycle DRAM (FCRAM divides each row in several sub-rows), SALP systems (Subarray-Level Parallelism System allows overlapping of different components of the bank access latencies of multiple requests that go to different subarrays within the same bank), Enhanced DRAM and Virtual Channel DRAM add a SRAM buffer to DRAM memory in order to cache the mostly accessed data, Tiered-Latency DRAM (TL-DRAM uses shorter bit lines with fewer cells), hybrid memory cube (places several memory modules dies on top of each other in a 3D cube shape) and embedded DRAM (eDRAM is integrated on the same chip die with the processor), [23], [32]- [35].…”

Section: Overview Of Techniques For Improving Memory Latency In Procementioning

confidence: 99%

“…This research resulted in creating a variety of memories that include processing capabilities, known as: computational RAM, intelligent RAM, processing in memory chips, intelligent memory systems, [13]- [23] etc. These smart chips usually integrate the processing elements into the DRAM memory, instead of extending the SRAM processor memory, basically because DRAM memory is characterized with higher density and lower price, [1], comparing to SRAM memory.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation