A quantitative evaluation of unified memory in GPUs

Yu, Qi; Childers, Bruce R.; Huang, Libo; Cheng, Qi; Wang, Zhiying

doi:10.1007/s11227-019-03079-y

Cited by 11 publications

(12 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The difference in memory access patterns across benchmarks put the hardware prefetcher in different degrees of efficacy. More detailed discussion on UVM hardware prefetchers can be found in other papers such as [6], [9], [19]. Observation/Suggestion: The above results on the effective PCIe bandwidth indicate that hardware prefetchers that are currently employed in GPUs cannot fully utilize PCIe bandwidth.…”

Section: Effect Of Data Migration On Pcie Bandwidthmentioning

confidence: 87%

“…Due to the large potential benefits of UVM and its associated performance issues, UVM has recently drawn significant attention from the research community. Several optimization techniques have been proposed to mitigate the side effects of UVM [5], [6], [8], [9], [12], [19], [20]. The earliest work is Zheng et al [20], which enables on-demand GPU memory and proposes prefetching techniques to improve UVM performance.…”

Section: Introductionmentioning

confidence: 99%

“…As the work predates the release of UVM, the developed on-demand memory APIs are quite different from the version in the current UVM practice. More recently, Ganguly et al [6], Yu et al [19] and Li et al [9] study prefetching and/or eviction techniques for UVM in more detail. However, their evaluation includes only benchmarks with limited number of access patterns, which makes it difficult to assess the effectiveness of their schemes on a broader range of benchmarks with diverse memory access patterns.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Gu¹,

Wu²,

Li³

et al. 2020

Preprint

View full text Add to dashboard Cite

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, shifts the complex memory management from programmers to GPU driver/ hardware, and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead due to the tracking and data migration along with the special handling of page faults and page table walk. As UVM is attracting significant attention from the research community to develop innovative solutions to these problems, in this paper, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on this important topic. The proposed UVMBench consists of 34 representative benchmarks from a wide range of application domains. The suite also features unified programming implementation and diverse memory access patterns across benchmarks, thus allowing thorough evaluation and comparison with current state-of-the-art. A set of experiments have been conducted on real GPUs to verify and analyze the benchmark suite behaviors under various scenarios.

show abstract

Section: Effect Of Data Migration On Pcie Bandwidthmentioning

confidence: 87%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Gu¹,

Wu²,

Li³

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…If the shared TLB cannot find a entry as well, it asks GMMU for traversing the page table and finding the entry. (Figure 2) GMMU utilizes up to 64 page table walker threads to process concurrent requests from multiple SMs in parallel [10]. Once GMMU finds the mapping, it returns the mapping to the requesting L2, L1, and SM.…”

Section: B Address Translation In Gp-gpumentioning

confidence: 99%

Demand MemCpy: Overlapping of Computation and Data Transfer for Heterogeneous Computing

2022

View full text Add to dashboard Cite

Heterogeneous computing relies on collaboration among different types of processors on shared data. In systems with discrete accelerators (e.g., GP-GPU), data sharing requires transferring a large amount of data between CPU and accelerator memories and can significantly increase the end-toend execution time. This paper proposes a novel mechanism called Demand MemCpy (DMC) to hide the data sharing overheads. DMC copies data from host memory to accelerator memory based on demands at page granularity. It utilizes a hardware-only mechanism to fetch the requested page with a short latency and the background pre-copy to fetch related pages in advance. Our evaluation shows that DMC can reduce the end-to-end execution time of GP-GPU application by 25.4% on average by overlapping computation with data transfer and not transferring unused pages.

show abstract

“…for all j ∈ nnz i do parallel in threads a pool of managed memory is accessible from both CPUs and GPUs using a single pointer within a multi-GPU system [25], [26]. One of the most salient feature of Unified Memory is that the system automatically migrates data allocated in Unified Memory (using cudaMallocManaged API) between the host and device.…”

Section: Sptrsv With Unified Memory a Communication Through Unified M...mentioning

confidence: 99%

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Xie

Chen

Firoz

et al. 2021

50th International Conference on Parallel Processing

View full text Add to dashboard Cite

Designing efficient and scalable sparse linear algebra kernels on modern multi-GPU based HPC systems is a daunting task due to significant irregular memory references and workload imbalance across the GPUs. This is particularly the case for Sparse Triangular Solver (SpTRSV) which introduces additional two-dimensional computation dependencies among subsequent computation steps. Dependency information is exchanged and shared among GPUs, thus warrant for efficient memory allocation, data partitioning, and workload distribution as well as fine-grained communication and synchronization support. In this work, we demonstrate that directly adopting unified memory can adversely affect the performance of SpTRSV on multi-GPU architectures, despite linking via fast interconnect like NVLinks and NVSwitches. Alternatively, we employ the latest NVSHMEM technology based on Partitioned Global Address Space programming model to enable efficient fine-grained communication and drastic synchronization overhead reduction. Furthermore, to handle workload imbalance, we propose a malleable task-pool execution model which can further enhance the utilization of GPUs. By applying these techniques, our experiments on the NVIDIA multi-GPU supernode V100-DGX-1 and DGX-2 systems demonstrate that our design can achieve on average 3.53× (up to 9.86×) speedup on a DGX-1 system and 3.66 × (up to 9.64×) speedup on a DGX-2 system with 4-GPUs over the Unified-Memory design. The comprehensive sensitivity and scalability studies also show that the proposed zero-copy SpTRSV is able to fully utilize the computing and communication resources of the multi-GPU system.

show abstract

A quantitative evaluation of unified memory in GPUs

Cited by 11 publications

References 32 publications

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Demand MemCpy: Overlapping of Computation and Data Transfer for Heterogeneous Computing

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures

Contact Info

Product

Resources

About