Challenges and trends in low-power 3D die-stacked IC designs using RAM, memristor logic, and resistive memory (ReRAM)

Chang, Meng-Fan; Chiu, Pi-Feng; Wu, Wei-Cheng; Chuang, Ching-Hao; Sheu, Shyh-Shyuan

doi:10.1109/asicon.2011.6157181

Cited by 20 publications

(6 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the feature size of SRAM is around 125 F 2 [19] (while that of RRAM around 4 F 2 [20]), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic [16], [17].Based on our experiment analysis, we observe that the baseline NVM-cache already can minimize off-chip memory accesses under many modern GPU computing applications. Therefore, in this work, we reduce the area size of LLC by employing our 3D-stacked NVM-cache architecture but it is still able to offer the storage capacity as same as what the baseline NVM-cache provides.…”

mentioning

confidence: 94%

See 1 more Smart Citation

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

Zhang

Donofrio

Shalf

2015

2015 International Conference on Parallel Architecture and Compilation (PACT)

View full text Add to dashboard Cite

General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times [1], [2], compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems [11], [12]. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions [18], [21].In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology [14], [15]. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix [13]. Since the feature size of SRAM is around 125 F 2 [19] (while that of RRAM around 4 F 2 [20]), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic [16], [17].Based on our experiment analysis, we observe that the baseline NVM-cache already can minimize off-chip memory accesses under many modern GPU computing applications. Therefore, in this work, we reduce the area size of LLC by employing our 3D-stacked NVM-cache architecture but it is still able to offer the storage capacity as same as what the baseline NVM-cache provides. This turns out that the onchip LLC with our 3D-stacked NVM-cache only requires 1/4 room space, which allows modern GPU systems to employ a bigger L1D cache. Our two RRAM-integrated GPU architectures enable the systems to perform data processing without sacrifices on thread-level parallelism (imposed by thread block throttling on a streaming processor) and to address the issues brought by cache thrashing and off-chip memory accesses. (a) IPC. (b) Last-level cache misses. Figure 1: Performance improvement. (a) DRAM dynamic energy. (b) LLC energy. Figure 2: Energy efficiency.To evaluate the performance of SRAM-based LLC and our NVM-cache approaches, we used CACTI 6.5 [3] and NVSim 1.0 [4], respectively, and implemented them into GPGPUSim 3.2.2 [5]. For energy measurement, we also used GPUWattch [6]. In this primary evaluation, we selected ten applications from Polybench [7], Rodinia [8], Mars [9], and Parboil [10]. Fig. 1(a) shows that, our baseline NVMcache and 3D-stack...

show abstract

mentioning

confidence: 94%

mentioning

confidence: 99%

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

Zhang

Donofrio

Shalf

2015

2015 International Conference on Parallel Architecture and Compilation (PACT)

View full text Add to dashboard Cite

show abstract

“…Emerging non-volatile memory (NVM) technologies, such as oxygen vacancy-driven resistive switches, also known as ReRAM or memristors ( Chang et al, 2011 ; Wong et al, 2012 ; Chen, 2020 ), can combine data processing and storage. Memristor matrices (crossbar arrays) use physical principles to enable efficient parallel multiply-accumulate (MAC) operations ( Hu et al, 2018 ).…”

Section: Introductionmentioning

confidence: 99%

Gradient Decomposition Methods for Training Neural Networks With Non-ideal Synaptic Devices

et al. 2021

View full text Add to dashboard Cite

While promising for high-capacity machine learning accelerators, memristor devices have non-idealities that prevent software-equivalent accuracies when used for online training. This work uses a combination of Mini-Batch Gradient Descent (MBGD) to average gradients, stochastic rounding to avoid vanishing weight updates, and decomposition methods to keep the memory overhead low during mini-batch training. Since the weight update has to be transferred to the memristor matrices efficiently, we also investigate the impact of reconstructing the gradient matrixes both internally (rank-seq) and externally (rank-sum) to the memristor array. Our results show that streaming batch principal component analysis (streaming batch PCA) and non-negative matrix factorization (NMF) decomposition algorithms can achieve near MBGD accuracy in a memristor-based multi-layer perceptron trained on the MNIST (Modified National Institute of Standards and Technology) database with only 3 to 10 ranks at significant memory savings. Moreover, NMF rank-seq outperforms streaming batch PCA rank-seq at low-ranks making it more suitable for hardware implementation in future memristor-based accelerators.

show abstract

“…In this architecture, memory devices and access devices are directly connected by a stacking method for high-density implementation. However, in addition to PRAMs, magnetic random access memories and resistive random access memories (ReRAMs) are considered as memory devices for highdensity 3D memories [5][6][7][8][9]. Further, a variety of access devices such as silicon-metal diodes, silicon NPN devices, and threshold vacuum switching devices are being considered [10].…”

Section: Introductionmentioning

confidence: 99%

Downscaling AsTeGeSiN threshold switching devices for high‐density 3D memories

Choi

2018

IET Circuits, Devices & Systems

View full text Add to dashboard Cite

In high-density three-dimensional (3D) memory technology, a stacking method is used to create memory devices and access devices at the intersections of bit lines and word lines. For this application, access devices should have a high on/off ratio, high current density for writing cycles, and high endurance. Consequently, an arsenic-tellurium-germanium-silicon nitride compound (AsTeGeSiN) threshold switching device with a high current density of 10 4 A/cm 2 above the threshold voltage (V th) is reported as a good candidate for use in access devices. In addition, scaling down of access devices as well as memory devices is essential for high-density 3D memories. However, in AsTeGeSiN threshold switching devices, fast degradation by pulse cycling in smaller devices is observed. To find the main cause of fast degradation by pulse cycling in smaller devices, the lowfrequency noise properties are examined. The rapid increase in the trap density (N T) in small devices is the main cause of fast degradation by pulse cycling in AsTeGeSiN devices. On the basis of this evaluation, the author examines the effect of annealing temperature and annealing time on the pulse endurance in smaller devices. Using an annealing temperature of ∼600°C improves the cycling endurance of smaller devices.

show abstract

Challenges and trends in low-power 3D die-stacked IC designs using RAM, memristor logic, and resistive memory (ReRAM)

Cited by 20 publications

References 14 publications

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

Integrating 3D Resistive Memory Cache into GPGPU for Energy-Efficient Data Processing

Gradient Decomposition Methods for Training Neural Networks With Non-ideal Synaptic Devices

Downscaling AsTeGeSiN threshold switching devices for high‐density 3D memories

Contact Info

Product

Resources

About