Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Agarwal, Neha; Nellans, David; Stephenson, Mark W.; O’Connor, Mike; Keckler, Stephen W.

doi:10.1145/2775054.2694381

Cited by 37 publications

(31 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To reduce the data movement cost, we selectively place some data objects in DRAM at the beginning of the application, instead of placing all data objects in NVM. e existing work has demonstrated performance bene t of the initial data placement on GPU with HMS [1,25]. Our initial data placement technique on NVM-based HMS is consistent with those existing e orts.…”

Section: Optimizationsupporting

confidence: 82%

“…BW d at a ob j = #dat a access × cacheline size #s ampl e s w i t h d at a acc e s s e s #s ampl e s × phase ex ecut ion t ime (1) e numerator of Equation 1 is the accessed data size. #data access in the numerator is the number of memory accesses to the data object in main memory.…”

Section: Designmentioning

confidence: 99%

“…ey introduce hardware modi cations to support massive data migration and performance optimization. Agarwal et al [1] introduce a bandwidth-aware data placement on GPU, driven by compiler extracted insights and explicit hints from programmers.…”

Section: Related Workmentioning

confidence: 99%

See 2 more Smart Citations

Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

Ren

Liu

2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

Non-volatile memory (NVM) provides a scalable and power-e cient solution to replace DRAM as main memory. However, because of relatively high latency and low bandwidth of NVM, NVM is often paired with DRAM to build a heterogeneous memory system (HMS). As a result, data objects of the application must be carefully placed to NVM and DRAM for best performance. In this paper, we introduce a lightweight runtime solution that automatically and transparently manage data placement on HMS without the requirement of hardware modi cations and disruptive change to applications. Leveraging online pro ling and performance models, the runtime characterizes memory access pa erns associated with data objects, and minimizes unnecessary data movement. Our runtime solution e ectively bridges the performance gap between NVM and DRAM. We demonstrate that using NVM to replace the majority of DRAM can be a feasible solution for future HPC systems with the assistance of a so ware-based data management.

show abstract

Section: Optimizationsupporting

confidence: 82%

Section: Designmentioning

confidence: 99%

See 1 more Smart Citation

Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

Ren

Liu

2018

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

show abstract

“…• Copy data from host memory to device (GPU) memory • Launch the function-called kernel-to be executed on the GPU • Wait until the kernel finishes • Copy the output from device memory to host memory In the real-time systems community, GPUs have been studied actively in recent years because of their potential benefits in accelerating demanding data-parallel real-time applications [5]. As observed in [6], GPU kernels typically demand high memory bandwidth to achieve high data parallelism and, if the memory bandwidth required by GPU kernels is not satisfied, it can result in significant performance reduction. For discrete GPUs, which have dedicated graphic memories, researchers have focused on addressing interference among the co-scheduled GPU tasks.…”

Section: Background and Related Workmentioning

confidence: 99%

Work-In-Progress: Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

Ali

Yun

2017

2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)

View full text Add to dashboard Cite

Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. For example, in the NVIDIA Jetson TX2 platform, an integrated CPU-GPU architecture, we observed that, in the worst case, the GPU kernels can suffer as much as 3X slowdown in the presence of co-running memory intensive CPU applications. In this paper, we propose a software mechanism, which we call BWLOCK++, to protect the performance of GPU kernels from co-scheduled memory intensive CPU applications.

show abstract

“…Similarly, Micron's Hybrid Memory Cube [4,5] and byte-addressable persistent memories [6][7][8][9] are quickly gaining traction. Vendors are combining these high-performance memories with traditional high-capacity and low-cost DRAM, prompting research on heterogeneous memory architectures [2,[9][10][11][12][13][14][15].…”

Section: Introductionmentioning

confidence: 99%

Hardware Translation Coherence for Virtualized Systems

Yan

Vesely

Cox

et al. 2017

Proceedings of the 44th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

To improve system performance, modern operating systems (OSes) often undertake activities that require modification of virtual-to-physical page translation mappings. For example, the OS may migrate data between physical frames to defragment memory and enable superpages. The OS may migrate pages of data between heterogeneous memory devices. We refer to all such activities as page remappings. Unfortunately, page remappings are expensive. We show that translation coherence is a major culprit and that systems employing virtualization are especially badly affected by their overheads. In response, we propose hardware translation invalidation and coherence or HATRIC, a readily implementable hardware mechanism to piggyback translation coherence atop existing cache coherence protocols. We perform detailed studies using KVM-based virtualization, showing that HATRIC achieves up to 30% performance and 10% energy benefits, for per-CPU area overheads of 2%. We also quantify HATRIC's benefits on systems running Xen and find up to 33% performance improvements.

show abstract

Page Placement Strategies for GPUs within Heterogeneous Memory Systems

Cited by 37 publications

References 27 publications

Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

Runtime Data Management on Non-Volatile Memory-based Heterogeneous Memory for Task-Parallel Programs

Work-In-Progress: Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

Hardware Translation Coherence for Virtualized Systems

Contact Info

Product

Resources

About