Locality-Driven Dynamic GPU Cache Bypassing

Li, Chao; Song, Shuaiwen Leon; Dai, Hongwen; Sidelnik, Albert; Hari, Siva Kumar Sastry; Zhou, Huiyang

doi:10.1145/2751205.2751237

Cited by 104 publications

(46 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Besides, we enhance the baseline L1D and L2 caches with a XOR-based set index hashing technique [26], making it close to the real GPU device's configuration. Subsequently, we implement seven different warp schedulers: (1) GTO (GTO scheduler with set-index hashing [26]); (2) CCWS; (3) Best-SWL (best static wavefront limiting); (4) statPCAL (representative implementation of bypass scheme [27] that performs similar or better than [6], [28]); (5) CIAO-P (CIAO with only redirecting memory requests of interfering warp to shared memory); (6) CIAO-T (CIAO with only selective warp throttling); and (7) CIAO-C (CIAO with both CIAO-T and CIAO-P). Note that CCWS, Best-SWL, and CIAO-P/T/C leverage GTO to decide the order of execution of warps.…”

Section: A Methodologymentioning

confidence: 99%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

A modern GPU aims to simultaneously execute more warps for higher Thread-Level Parallelism (TLP) and performance. When generating many memory requests, however, warps contend for limited cache space and thrash cache, which in turn severely degrades performance. To reduce such cache thrashing, we may adopt cache locality-aware warp scheduling which gives higher execution priority to warps with higher potential of data locality. However, we observe that warps with high potential of data locality often incurs far more cache thrashing or interference than warps with low potential of data locality. Consequently, cache locality-aware warp scheduling may undesirably increase cache interference and/or unnecessarily decrease TLP.In this paper, we propose Cache Interference-Aware throughput-Oriented (CIAO) on-chip memory architecture and warp scheduling which exploit unused shared memory space and take insight opposite to cache locality-aware warp scheduling. Specifically, CIAO on-chip memory architecture can adaptively redirect memory requests of severely interfering warps to unused shared memory space to isolate memory requests of these interfering warps from those of interfered warps. If these interfering warps still incur severe cache interference, CIAO warp scheduling then begins to selectively throttle execution of these interfering warps. Our experiment shows that CIAO can offer 54% higher performance than prior cache locality-aware scheduling at a small chip cost.

show abstract

Section: A Methodologymentioning

confidence: 99%

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang

Gao

Kim

2018

2018 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

View full text Add to dashboard Cite

show abstract

“…In the worst case, due to the lack of L2 cache capacity, it is sometimes necessary to load the evicted data from the off-chip memory. 6,31,[33][34][35][36][37][38][39][40][41] Shared memory is an alternative to the L1 cache for storing preloaded data. There are several reasons to support this.…”

Section: Preloading In the Shared Memorymentioning

confidence: 99%

“…As many previous research studies have shown, effectively hiding cache resource contention is a crucial step to achieving high performance on GPUs. 6,31,[33][34][35][36][37][38][39][40][41]43 Previous studies of resolving the resource contention problems are based on dynamic analysis methods that require hardware modification. In addition to preloading in shared memory efficiently, it is necessary to combine static analysis to avoid the L1 cache from the resource contentions effectively.…”

Section: Impact Of Various Preload Factorsmentioning

confidence: 99%

Static code transformations for thread‐dense memory accesses in GPU computing

Kim

Hong

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Due to the GPU's complex memory system and massive thread-level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low-latency on-chip memory to avoid high latency of off-chip memory accesses. Shared memory is an on-chip memory, which is explicitly managed by programmers.Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread-density for preloading in shared memory. The thread-dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread-level parallelism remains at the same level when selecting datasets for preloading. Finally, our source-to-source compiler allows to preload selected datasets in shared memory by transforming non-optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs. KEYWORDScode transformation, GPU computing, shared memory, static analysis INTRODUCTIONGraphics processing units (GPUs) are very useful in accelerating scientific applications and even in accelerating machine learning applications.To take advantage of GPU computing power in these applications, it is essential to reduce the high-latency off-chip memory access. Several memory access transformation methods have been explored to optimize off-chip memory accesses by utilizing low-latency on-chip memory. 1-8The use of shared memory, which is located in on-chip and explicitly managed by user-written kernel codes, is one way to avoid the high-latency overhead within off-chip memory access. 9-16 Despite the beneficial characteristics of shared memory, applications often leave out shared memory unused, mainly due to the extra management of address space in shared memory. Domain-specific programmers prefer using the hardware-managed L1 cache rather than shared memory for programming simplicity. 2,17 None of 13 applications in PolyBench benchmark and only 14 of 23 applications in Rodinia benchmarks use shared memory. 18,19Complex memory system and massive thread-level parallelism often make it prohibitively difficult for domain-specific application programmers to optimize memory access patterns in GPU computing. Furthermore, GPU architectures are evolving rapidly, which makes developers rewrite GPU kernels for different generations. To overcome these hurdles, studies of compiler-based optimization and analysis tools are carried out to support programmers with no in-depth knowledge of the GPU architecture. 5,6,11,[20][21][22][23][24] This paper proposes a static code transformation for preloading data in shared memory of GPUs. Our software-only approach focuses primarily on off...

show abstract

“…Xie et al [22] recently studied a coordinated static and dynamic cache bypassing, in which a subset of thread blocks is analyzed at runtime to bypass the LI cache. Li et al [23] proposed to use locality monitoring mechanism to dynamically bypass LI data caches for GPUs. Compared to all these studies that require nontrivial hardware support, our method is based on a considerably simpler hardware extension, which consists of a threshold register and a comparator.…”

Section: Related Workmentioning

confidence: 99%

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Huangfu¹,

Zhang²

2017

Journal of Computing Science and Engineering

View full text Add to dashboard Cite

Recent GPUs have adopted cache memory to benefit general-purpose GPU (GPGPU) programs. However, unlike CPU programs, GPGPU programs typically have considerably less temporal/spatial locality. Moreover, the L1 data cache is used by many threads that access a data size typically considerably larger than the L1 cache, making it critical to bypass L1 data cache intelligently to enhance GPU cache performance. In this paper, we examine GPU cache access behavior and propose a simple hardware-based GPU cache bypassing method that can be applied to GPU applications without recompiling programs. Moreover, we introduce a hybrid method that integrates static profiling information and hardware-based bypassing to further enhance performance. Our experimental results reveal that hardware-based cache bypassing can boost performance for most benchmarks, and the hybrid method can achieve performance comparable to state-of-the-art compiler-based bypassing with considerably less profiling cost.

show abstract

Locality-Driven Dynamic GPU Cache Bypassing

Cited by 104 publications

References 30 publications

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Static code transformations for thread‐dense memory accesses in GPU computing

Enhancing GPU Performance by Efficient Hardware-Based and Hybrid L1 Data Cache Bypassing

Contact Info

Product

Resources

About