Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Agarwal, Neha; Nellans, David; Stephenson, Mark W.; O'Connor, Mike; Keckler, Stephen W.

doi:10.1145/2694344.2694381

Cited by 95 publications

(35 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…• BW-AWARE denotes a synergistic bandwidth-aware page placement policy in [2]. It places the GPU pages across the heterogeneous memory system, i.e., GDDR5 and DDR3 DRAM, and their memory bandwidth is shared across GPU pages.…”

Section: Simulation Platformmentioning

confidence: 99%

“…Because of discrepant bandwidth requirements, it is anticipated that GPU will be still physically attached with bandwidth-optimized DRAM, while CPU is attached with capacity-and cost-optimized DRAM. DRAMs of GPU and CPU share a unified memory address space [2]. In such heterogeneous cache coherent non-uniform memory access (CC-NUMA) system, a computing unit has different access delays to local and remote memories even it sees a unified address space.…”

Section: Introductionmentioning

confidence: 99%

“…We compare TEMP and TBAS with some representative thread scheduling techniques, including the cache-conscious wavefront scheduler (CCWS) [37], OWL [18] and the bandwidth-aware policy (BW-AWARE) [2]. We set CCWS as our baseline and integrate OWL, BW-AWARE and our techniques on top of CCWS.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Thread Batching for High-performance Energy-efficient GPU Memory Design

Mao

Liu

et al. 2019

J. Emerg. Technol. Comput. Syst.

View full text Add to dashboard Cite

Massive multi-threading in GPU imposes tremendous pressure on memory subsystems. Due to rapid growth in thread-level parallelism of GPU and slowly improved peak memory bandwidth, memory becomes a bottleneck of GPU's performance and energy efficiency. In this work, we propose an integrated architectural scheme to optimize the memory accesses and therefore boost the performance and energy efficiency of GPU. Firstly, we propose a thread batch enabled memory partitioning (TEMP) to improve GPU memory access parallelism. In particular, TEMP groups multiple thread blocks that share the same set of pages into a thread batch and applies a page coloring mechanism to bound each stream multiprocessor (SM) to the dedicated memory banks. After that, TEMP dispatches the thread batch to an SM to ensure high-parallel memory-access streaming from the different thread blocks. Secondly, a thread batch-aware scheduling (TBAS) scheme is introduced to improve the GPU memory access locality and to reduce the contention on memory controllers and interconnection networks. Experimental results show that the integration of TEMP and TBAS can achieve up to 10.3% performance improvement and 11.3% DRAM energy reduction across diverse GPU applications. We also evaluate the performance interference of the mixed CPU+GPU workloads when they are run on a heterogeneous system that employs our proposed schemes. Our results show that a simple solution can effectively ensure the efficient execution of both GPU and CPU applications. pages. https://doi.org/000000x.000000x INTRODUCTIONThe use of Graphics Processing Units (GPUs) has been extended from fixed graphics acceleration to general purpose computing, including image processing, computer vision, machine learning, and scientific computing. GPU is widely employed in various platforms ranging from embedded systems to high-performance computing systems [38].GPU heavily relies on massive threading to achieve high throughput. However, it commonly incurs intensive memory accesses, which may limit the performance and energy efficiency of GPU [19] as the result of the high overhead of device memory access 1 . Though large-capacity and low-overhead cache have been adopted by GPU to alleviate the impact of inefficient memory accesses [1,25], the available cache per thread is far below the demand of most GPU applications [17]. The pressures on device memory, i.e., DRAMs, in GPU are still severe.Memory scheduling is one of the primary architectural techniques to improve memory efficiency as it is able to optimize the memory access parallelism and locality in multi-core systems [9,21,29,30,42]. However, the existing memory scheduling algorithms are usually associated with expensive implementation [23] and also insufficient to handle the intensive memory accesses in GPU [3,46].The memory partitioning (MP) based on operating system (OS) memory management is another viable approach to improve memory efficiency and reduce inter-thread memory interference. Memory partitioning generally divides memory resources and ass...

show abstract

Section: Simulation Platformmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Thread Batching for High-performance Energy-efficient GPU Memory Design

Mao

Liu

et al. 2019

J. Emerg. Technol. Comput. Syst.

View full text Add to dashboard Cite

show abstract

“…There are many previous works [2,12,13,14] on heterogeneous memory systems where DRAM and NVM coexist. Kannan et al particularly allowed to manage DRAM and NVM in a single virtual space and studied how NVM can be used to perform tasks that require persistence attributes.…”

Section: Chapter 6 Related Workmentioning

confidence: 99%

“…However, they did not study how to maximize the bandwidth of different types of memory clusters on the system. The most similar previous work was that of [12]. This work studied the bandwidth-aware memory placement policy that considers the CPU and GPU bandwidths.…”

Section: Chapter 6 Related Workmentioning

confidence: 99%