A model-driven approach to warp/thread-block level GPU cache bypassing

Dai, Hongwen; Li, Chao; Zhou, Huiyang; Gupta, Saurabh; Kartsaklis, Christos; Mantor, Mike

doi:10.1145/2897937.2897966

Cited by 17 publications

(6 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…As for GPGPU workloads, many more works have targeted cache locality to improve performance. There are many works in literature [9], [24], [29], [31], [32], [47], [50], [56], [57], [59], [61] that explore cache bypassing to improve GPU cache locality. Some works have targeted cache locality across kernel launches for parent-child kernels [52] or generic dependent kernels [16].…”

Section: Related Workmentioning

confidence: 99%

Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs

Joseph,

Aragón,

Parcerisa

et al. 2023

2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

Literature is plentiful in works exploiting cache locality for GPUs. A majority of them explore replacement or bypassing policies. In this paper, however, we surpass this exploration by fabricating a formal proof for a no-overhead quasi-optimal caching technique for caching textures in graphics workloads. Textures make up a significant part of main memory traffic in mobile GPUs, which contributes to the total GPU energy consumption. Since texture accesses use a shared L2 cache, improving the L2 texture caching efficiency would decrease main memory traffic, thus improving energy efficiency, which is crucial for mobile GPUs. Our proposal reaches quasi-optimality by exploiting the frame-to-frame reuse of textures in graphics. We do this by traversing frames in a boustrophedonic 1 manner w.r.t. the frame-to-frame tile order. We first approximate the texture access trace to a circular trace and then forge a formal proof for our proposal being optimal for such traces. We also complement the proof with empirical data that demonstrates the quasi-optimality of our no-cost proposal.

show abstract

Section: Related Workmentioning

confidence: 99%

Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs

Joseph,

Aragón,

Parcerisa

et al. 2023

2023 32nd International Conference on Parallel Architectures and Compilation Techniques (PACT)

View full text Add to dashboard Cite

show abstract

“…However, on the GPU [15][16][17][18][19][20][21][22], the model based on the cache hit rate does not always perform well due to its unique architectural characteristics, including a lot of parallelisms, resource congestion, and memory divergence. A model-driven approach was developed by [23] which dynamically estimates the impact of cache contention and resource congestion as a function of the number of warps/thread blocks (TBs) to bypass the cache. Xie et al [17] proposed a compiler-based method to access or bypass the cache by analyzing reuse distance and memory traffic.…”

Section: Related Workmentioning

confidence: 99%

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

2021

View full text Add to dashboard Cite

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

show abstract

“…Prior work uses GPU modeling techniques to guide runtime optimizations (e.g., DVFS configuration [15] and cache missrelated optimizations [16]) or GPU resource scaling analysis [2]. Our work provides an accurate model for fast design space exploration.…”

Section: Related Workmentioning

confidence: 99%

Modeling Emerging Memory-Divergent GPU Applications

Wang

Jahre

Adileh

et al. 2019

IEEE Comput. Arch. Lett.

View full text Add to dashboard Cite

Analytical performance models yield valuable architectural insight without incurring the excessive runtime overheads of simulation. In this work, we study contemporary GPU applications and find that the key performance-related behavior of such applications is distinct from traditional GPU applications. The key issue is that these GPU applications are memory-intensive and have poor spatial locality, which implies that the loads of different threads commonly access different cache blocks. Such memory-divergent applications quickly exhaust the number of misses the L1 cache can process concurrently, and thereby cripple the GPU's ability to use Memory-Level Parallelism (MLP) and Thread-Level Parallelism (TLP) to hide memory latencies. Our Memory Divergence Model (MDM) is able to accurately represent this behavior and thereby reduces average performance prediction error by 14× compared to the state-of-the-art GPUMech approach across our memory-divergent applications.

show abstract

A model-driven approach to warp/thread-block level GPU cache bypassing

Cited by 17 publications

References 31 publications

Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs

Boustrophedonic Frames: Quasi-Optimal L2 Caching for Textures in GPUs

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Modeling Emerging Memory-Divergent GPU Applications

Contact Info

Product

Resources

About