Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Gaur, Jayesh; Srinivasan, Raghuram; Subramoney, Sreenivas; Chaudhuri, Mainak

doi:10.1145/2540708.2540742

Cited by 22 publications

(8 citation statements)

References 42 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They contain a wide range of applications that fall into various research categories. The selected applications in Table 2 are also used in the previous studies [4,10,16, 23ś26, 28, 30ś 33, 42ś45, 53ś55].…”

Section: Discussionmentioning

confidence: 99%

See 1 more Smart Citation

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Shen

Song

et al. 2018

Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018

View full text Add to dashboard Cite

General-purpose GPUs have been widely utilized to accelerate parallel applications. Given a relatively complex programming model and fast architecture evolution, producing efficient GPU code is nontrivial. A variety of simulation and profiling tools have been developed to aid GPU application optimization and architecture design. However, existing tools are either limited by insufficient insights or lacking in support across different GPU architectures, runtime and driver versions. This paper presents CUDAAdvisor, a profiling framework to guide code optimization in modern NVIDIA GPUs. CUDAAdvisor performs various fine-grained analyses based on the profiling results from GPU kernels, such as memory-level analysis (e.g., reuse distance and memory divergence), control flow analysis (e.g., branch divergence) and code-/data-centric debugging. Unlike prior tools, CU-DAAdvisor supports GPU profiling across different CUDA versions and architectures, including CUDA 8.0 and Pascal architecture. We demonstrate several case studies that derive significant insights to guide GPU code optimization for performance improvement.

show abstract

Section: Discussionmentioning

confidence: 99%

“…NVIDIA provides its own tools to support profiling CUDA code, such as Visual Profiler (NVP) [16], nvprof [1], and NSight [12]. These profilers collect performance data via hardware performance counters and lightweight binary instrumentation.…”

Section: Related Workmentioning

confidence: 99%

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Shen

Song

et al. 2018

Proceedings of the 2018 International Symposium on Code Generation and Optimization - CGO 2018

View full text Add to dashboard Cite

show abstract

“…It also employs a reactive bypassing scheme. Some work focuses on improving GPU cache performance through novel cache replacement methods [6,7,12,31,36,37]. A decoupled GPU L1 cache is proposed in [16] to enable dynamic locality filtering functionality in the extended tag store for efficient and accurate runtime cache bypassing.…”

Section: Related Workmentioning

confidence: 99%

Tag-Split Cache for Efficient GPGPU Cache Utilization

Hayes

Song

et al. 2016

Proceedings of the 2016 International Conference on Supercomputing

View full text Add to dashboard Cite

Modern GPUs employ cache to improve memory system efficiency. However, large amount of cache space is underutilized due to irregular memory accesses and poor spatial locality which exhibited commonly in GPU applications. Our experiments show that using smaller cache lines could improve cache space utilization, but it also frequently suffers from significant performance loss by introducing large amount of extra cache requests. In this work, we propose a novel cache design named tag-split cache (TSC) that enables fine-grained cache storage to address the problem of cache space underutilization while keeping memory request number unchanged. TSC divides tag into two parts to reduce storage overhead, and it supports multiple cache line replacement in one cycle. TSC can also automatically adjust cache storage granularity to avoid performance loss for applications with good spatial locality. Our evaluation shows that TSC improves the baseline cache performance by 17.2% on average across a wide range of applications. It also outperforms other previous techniques significantly.

show abstract

“…A unified GPU on-chip memory design is proposed by Gebhart et al [14] to satisfy varying capacity needs across different applications. LLC management policies for 3D scene rendering workloads on GPUs are explored by Gaur et al [13], while our work focuses on general purpose applications. Some other work studied cache management schemes for heterogeneous systems [27], [31].…”

Section: B Gpu Cache Managementmentioning

confidence: 99%

Adaptive Cache Management for Energy-Efficient GPU Computing

Chen

Chang

Rodrigues

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

148

View full text Add to dashboard Cite

Abstract-With the SIMT execution model, GPUs can hide memory latency through massive multithreading for many applications that have regular memory access patterns. To support applications with irregular memory access patterns, cache hierarchies have been introduced to GPU architectures to capture temporal and spatial locality and mitigate the effect of irregular accesses. However, GPU caches exhibit poor efficiency due to the mismatch of the throughput-oriented execution model and its cache hierarchy design, which limits system performance and energy-efficiency.The massive amount of memory requests generated by GPUs cause cache contention and resource congestion. Existing CPU cache management policies that are designed for multicore systems, can be suboptimal when directly applied to GPU caches. We propose a specialized cache management policy for GPGPUs. The cache hierarchy is protected from contention by the bypass policy based on reuse distance. Contention and resource congestion are detected at runtime. To avoid oversaturating on-chip resources, the bypass policy is coordinated with warp throttling to dynamically control the active number of warps. We also propose a simple predictor to dynamically estimate the optimal number of active warps that can take full advantage of the cache space and on-chip resources. Experimental results show that cache efficiency is significantly improved and on-chip resources are better utilized for cachesensitive benchmarks. This results in a harmonic mean IPC improvement of 74% and 17% (maximum 661% and 44% IPC improvement), compared to the baseline GPU architecture and optimal static warp throttling, respectively.

show abstract

Efficient management of last-level caches in graphics processors for 3D scene rendering workloads

Cited by 22 publications

References 42 publications

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Tag-Split Cache for Efficient GPGPU Cache Utilization

Adaptive Cache Management for Energy-Efficient GPU Computing

Contact Info

Product

Resources

About