General purpose graphics processing units (GPUs) have become a promising solution to process massive data by taking advantages of multithreading. Thanks to thread-level parallelism, GPU-accelerated applications improve the overall system performance by up to 40 times [1], [2], compared to CPU-only architecture. However, data-intensive GPU applications often generate large amount of irregular data accesses, which results in cache thrashing and contention problems [11], [12]. The cache thrashing in turn can introduce a large number of off-chip memory accesses, which not only wastes tremendous energy to move data around on-chip cache and off-chip global memory, but also significantly limits system performance due to many stalled load/store instructions [18], [21].In this work, we redesign the shared last-level cache (LLC) of GPU devices by introducing non-volatile memory (NVM), which can address the cache thrashing issues with low energy consumption. Specifically, we investigate two architectural approaches, one of each employs a 2D planar resistive random-access memory (RRAM) as our baseline NVM-cache and a 3D-stacked RRAM technology [14], [15]. Our baseline NVM-cache replaces the SRAM-based L2 cache with RRAM of similar area size; a memory die consists of eight subarrays, one of which a small fraction of memristor island by constructing 512x512 matrix [13]. Since the feature size of SRAM is around 125 F 2 [19] (while that of RRAM around 4 F 2 [20]), it can offer around 30x bigger storage capacity than the SRAM-based cache. To make our baseline NVM-cache denser, we proposed 3D-stacked NVM-cache, which piles up four memory layers, and each of them has a single pre-decode logic [16], [17].Based on our experiment analysis, we observe that the baseline NVM-cache already can minimize off-chip memory accesses under many modern GPU computing applications. Therefore, in this work, we reduce the area size of LLC by employing our 3D-stacked NVM-cache architecture but it is still able to offer the storage capacity as same as what the baseline NVM-cache provides. This turns out that the onchip LLC with our 3D-stacked NVM-cache only requires 1/4 room space, which allows modern GPU systems to employ a bigger L1D cache. Our two RRAM-integrated GPU architectures enable the systems to perform data processing without sacrifices on thread-level parallelism (imposed by thread block throttling on a streaming processor) and to address the issues brought by cache thrashing and off-chip memory accesses. (a) IPC. (b) Last-level cache misses. Figure 1: Performance improvement. (a) DRAM dynamic energy. (b) LLC energy. Figure 2: Energy efficiency.To evaluate the performance of SRAM-based LLC and our NVM-cache approaches, we used CACTI 6.5 [3] and NVSim 1.0 [4], respectively, and implemented them into GPGPUSim 3.2.2 [5]. For energy measurement, we also used GPUWattch [6]. In this primary evaluation, we selected ten applications from Polybench [7], Rodinia [8], Mars [9], and Parboil [10]. Fig. 1(a) shows that, our baseline NVMcache and 3D-stack...