2017
DOI: 10.1145/3093336.3037709
|View full text |Cite
|
Sign up to set email alerts
|

Locality-Aware CTA Clustering for Modern GPUs

Abstract: Cache is designed to exploit locality; however, the role of on-chip L1 data caches on modern GPUs is often awkward. The locality among global memory requests from different SMs (Streaming Multiprocessors) is predominantly harvested by the commonly-shared L2 with long access latency; while the in-core locality, which is crucial for performance delivery, is handled explicitly by user-controlled scratchpad memory. In this work, we disclose another type of data locality that has been long ignored but with performa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
4
4

Relationship

3
5

Authors

Journals

citations
Cited by 15 publications
(16 citation statements)
references
References 37 publications
0
16
0
Order By: Relevance
“…In each round of calling smem-merge, the synchronization between multiple thread blocks is necessary: aer the execution of reg-sort and smem-merge in each block, intermediate results need to be synchronized across all cooperative blocks via the global memory [47]. Aer that, the partially sorted data will be repartitioned and assigned to each block by using a similar partitioning method in smem-merge, and utilizing inter-block locality will in general further improve overall performance [27]. As shown in Fig.…”
Section: Methodology 31 Adaptive Gpu Segsort Mechanismmentioning
confidence: 99%
“…In each round of calling smem-merge, the synchronization between multiple thread blocks is necessary: aer the execution of reg-sort and smem-merge in each block, intermediate results need to be synchronized across all cooperative blocks via the global memory [47]. Aer that, the partially sorted data will be repartitioned and assigned to each block by using a similar partitioning method in smem-merge, and utilizing inter-block locality will in general further improve overall performance [27]. As shown in Fig.…”
Section: Methodology 31 Adaptive Gpu Segsort Mechanismmentioning
confidence: 99%
“…Multiple prior CTA schedulers [3,38,62] have used different heuristics to exploit the locality across CTAs. However, they are not ideal [26,40,66], and the fundamental problem of cache line replication across private L1 caches remains. While the goal of these schedulers is to improve cache performance, our approach 1) is not dependent on any scheduling algorithm, 2) does not require any software support to determine private and shared data, and 3) does not only reduce replication but can eliminate it.…”
Section: Related Workmentioning
confidence: 99%
“…In particular, we focus on addressing the inefficiencies associated with GPUs' private local L1 caches. Specifically, because of the private nature of the L1 caches, the same cache lines can be requested by different cores, leading to high inter-core locality [15,23,33,40,41]. This data (cache line) replication reduces the effective aggregate capacity of the L1 caches across all cores, leading to their lower bandwidth utilization as we will show in Section 2.…”
Section: Introductionmentioning
confidence: 99%
“…To summarize, we argue that GPUs are the promising platform for the ALS workload when taking both performance and power consumption into account. In the future, we will further investigate the performance gap between platforms and push the factorizing performance to the hardware limit (in particular on newer Intel Xeon Phi processors with onpackage high bandwidth memory [35,36], newer GPUs on warp-level [37,38], CTA-level [39] and cache-level [40], and other emergent accelerators such as Matrix-2000 [41]).…”
Section: Applying Optimizationsmentioning
confidence: 99%