Proceedings of the 29th ACM on International Conference on Supercomputing 2015
DOI: 10.1145/2751205.2751237
|View full text |Cite
|
Sign up to set email alerts
|

Locality-Driven Dynamic GPU Cache Bypassing

Abstract: This paper presents novel cache optimizations for massively parallel, throughput-oriented architectures like GPUs. L1 data caches (L1 D-caches) are critical resources for providing high-bandwidth and low-latency data accesses. However, the high number of simultaneous requests from singleinstruction multiple-thread (SIMT) cores makes the limited capacity of L1 D-caches a performance and energy bottleneck, especially for memory-intensive applications. We observe that the memory access streams to L1 D-caches for … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
44
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
2
2

Relationship

0
8

Authors

Journals

citations
Cited by 104 publications
(46 citation statements)
references
References 30 publications
0
44
0
Order By: Relevance
“…Besides, we enhance the baseline L1D and L2 caches with a XOR-based set index hashing technique [26], making it close to the real GPU device's configuration. Subsequently, we implement seven different warp schedulers: (1) GTO (GTO scheduler with set-index hashing [26]); (2) CCWS; (3) Best-SWL (best static wavefront limiting); (4) statPCAL (representative implementation of bypass scheme [27] that performs similar or better than [6], [28]); (5) CIAO-P (CIAO with only redirecting memory requests of interfering warp to shared memory); (6) CIAO-T (CIAO with only selective warp throttling); and (7) CIAO-C (CIAO with both CIAO-T and CIAO-P). Note that CCWS, Best-SWL, and CIAO-P/T/C leverage GTO to decide the order of execution of warps.…”
Section: A Methodologymentioning
confidence: 99%
“…Besides, we enhance the baseline L1D and L2 caches with a XOR-based set index hashing technique [26], making it close to the real GPU device's configuration. Subsequently, we implement seven different warp schedulers: (1) GTO (GTO scheduler with set-index hashing [26]); (2) CCWS; (3) Best-SWL (best static wavefront limiting); (4) statPCAL (representative implementation of bypass scheme [27] that performs similar or better than [6], [28]); (5) CIAO-P (CIAO with only redirecting memory requests of interfering warp to shared memory); (6) CIAO-T (CIAO with only selective warp throttling); and (7) CIAO-C (CIAO with both CIAO-T and CIAO-P). Note that CCWS, Best-SWL, and CIAO-P/T/C leverage GTO to decide the order of execution of warps.…”
Section: A Methodologymentioning
confidence: 99%
“…In the worst case, due to the lack of L2 cache capacity, it is sometimes necessary to load the evicted data from the off-chip memory. 6,31,[33][34][35][36][37][38][39][40][41] Shared memory is an alternative to the L1 cache for storing preloaded data. There are several reasons to support this.…”
Section: Preloading In the Shared Memorymentioning
confidence: 99%
“…As many previous research studies have shown, effectively hiding cache resource contention is a crucial step to achieving high performance on GPUs. 6,31,[33][34][35][36][37][38][39][40][41]43 Previous studies of resolving the resource contention problems are based on dynamic analysis methods that require hardware modification. In addition to preloading in shared memory efficiently, it is necessary to combine static analysis to avoid the L1 cache from the resource contentions effectively.…”
Section: Impact Of Various Preload Factorsmentioning
confidence: 99%
“…Xie et al [22] recently studied a coordinated static and dynamic cache bypassing, in which a subset of thread blocks is analyzed at runtime to bypass the LI cache. Li et al [23] proposed to use locality monitoring mechanism to dynamically bypass LI data caches for GPUs. Compared to all these studies that require nontrivial hardware support, our method is based on a considerably simpler hardware extension, which consists of a threshold register and a comparator.…”
Section: Related Workmentioning
confidence: 99%