Locality-Aware CTA Clustering for Modern GPUs

Liang, Weifa; Leon, SongShuaiwen; LiuWeifeng,; LiuXu,; KumarAkash,; CorporaalHenk,

doi:10.1145/3093336.3037709

Cited by 15 publications

(16 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In each round of calling smem-merge, the synchronization between multiple thread blocks is necessary: aer the execution of reg-sort and smem-merge in each block, intermediate results need to be synchronized across all cooperative blocks via the global memory [47]. Aer that, the partially sorted data will be repartitioned and assigned to each block by using a similar partitioning method in smem-merge, and utilizing inter-block locality will in general further improve overall performance [27]. As shown in Fig.…”

Section: Methodology 31 Adaptive Gpu Segsort Mechanismmentioning

confidence: 99%

Fast segmented sort on GPUs

Hou

Liu

Wang

et al. 2017

Proceedings of the International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

Segmented sort, as a generalization of classical sort, orders a batch of independent segments in a whole array. Along with the wider adoption of manycore processors for HPC and big data applications, segmented sort plays an increasingly important role than sort. In this paper, we present an adaptive segmented sort mechanism on GPUs. Our mechanisms include two core techniques: (1) a differentiated method for dierent segment lengths to eliminate the irregularity caused by various workloads and thread divergence; and (2) a register-based sort method to support N-to-M data-thread binding and in-register data communication. We also implement a shared memory-based merge method to support non-uniform length chunk merge via multiple warps. Our segmented sort mechanism shows great improvements over the methods from CUB, CUSP and ModernGPU on NVIDIA K80-Kepler and TitanX-Pascal GPUs. Furthermore, we apply our mechanism on two applications, i.e., sux array construction and sparse matrix-matrix multiplication, and obtain obvious gains over state-of-the-art implementations.

show abstract

Section: Methodology 31 Adaptive Gpu Segsort Mechanismmentioning

confidence: 99%

Fast segmented sort on GPUs

Hou

Liu

Wang

et al. 2017

Proceedings of the International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

show abstract

“…Multiple prior CTA schedulers [3,38,62] have used different heuristics to exploit the locality across CTAs. However, they are not ideal [26,40,66], and the fundamental problem of cache line replication across private L1 caches remains. While the goal of these schedulers is to improve cache performance, our approach 1) is not dependent on any scheduling algorithm, 2) does not require any software support to determine private and shared data, and 3) does not only reduce replication but can eliminate it.…”

Section: Related Workmentioning

confidence: 99%

“…In particular, we focus on addressing the inefficiencies associated with GPUs' private local L1 caches. Specifically, because of the private nature of the L1 caches, the same cache lines can be requested by different cores, leading to high inter-core locality [15,23,33,40,41]. This data (cache line) replication reduces the effective aggregate capacity of the L1 caches across all cores, leading to their lower bandwidth utilization as we will show in Section 2.…”

Section: Introductionmentioning

confidence: 99%

Analyzing and Leveraging Shared L1 Caches in GPUs

Ibrahim

Kayıran

Eckert

et al. 2020

Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques

View full text Add to dashboard Cite

Graphics Processing Units (GPUs) concurrently execute thousands of threads, which makes them effective for achieving high throughput for a wide range of applications. However, the memory wall often limits peak throughput. GPUs use caches to address this limitation, and hence several prior works have focused on improving cache hit rates, which in turn can improve throughput for memoryintensive applications. However, almost all of the prior works assume a conventional cache hierarchy where each GPU core has a private local L1 cache and all cores share the L2 cache. Our analysis shows that this canonical organization does not allow optimal utilization of caches because the private nature of L1 caches allows multiple copies of the same cache line to get replicated across cores. We introduce a new shared L1 cache organization, where all cores collectively cache a single copy of the data at only one location (core), leading to zero data replication. We achieve this by allowing each core to cache only a non-overlapping slice of the entire address range. Such a design is useful for significantly improving the collective L1 hit rates but incurs latency overheads from additional communications when a core requests data not allowed to be present in its own cache. While many workloads can tolerate this additional latency, several workloads show performance sensitivities. Therefore, we develop lightweight communication optimization techniques and a run-time mechanism that considers the latency-tolerance characteristics of applications to decide which applications should execute in private versus shared L1 cache organization and reconfigures the caches accordingly. In effect, we achieve significant performance and energy efficiency improvements, at a modest hardware cost, for applications that prefer the shared organization, with little to no impact on other applications. CCS CONCEPTS • Computer systems organization → Single instruction, multiple data.

show abstract

“…To summarize, we argue that GPUs are the promising platform for the ALS workload when taking both performance and power consumption into account. In the future, we will further investigate the performance gap between platforms and push the factorizing performance to the hardware limit (in particular on newer Intel Xeon Phi processors with onpackage high bandwidth memory [35,36], newer GPUs on warp-level [37,38], CTA-level [39] and cache-level [40], and other emergent accelerators such as Matrix-2000 [41]).…”

Section: Applying Optimizationsmentioning

confidence: 99%

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Chen

Fang

Liu

et al. 2020

Future Generation Computer Systems

Self Cite

View full text Add to dashboard Cite

Alternating least squares (ALS) has been proved to be an effective solver for matrix factorization in recommender systems. To speed up factorizing performance, various parallel ALS solvers have been proposed to leverage modern multi-cores and many-cores. Existing implementations are limited in either speed or portability. In this paper, we present an efficient and portable ALS solver (clMF) for recommender systems. On one hand, we diagnose the baseline implementation and observe that it lacks of the awareness of the hierarchical thread organization on modern hardware. To achieve high performance, we apply the thread batching technique, the fine-grained tiling technique and three architecture-specific optimizations. On the other hand, we implement the ALS solver in OpenCL so that it can run on various platforms (CPUs, GPUs and MICs). Based on the architectural specifics, we select a suitable code variant for each platform to efficiently map it to the underlying hardware. The experimental results show that our implementation performs 2.8×-15.7× faster on an Intel 16-core CPU, 23.9×-87.9× faster on an NVIDIA K20C GPU and 34.6×-97.1× faster on an AMD Fury X GPU than the baseline implementation. On the K20C GPU, our implementation also outperforms cuMF over different latent features ranging from 10 to 100 with various real-world recommendation datasets.

show abstract

Locality-Aware CTA Clustering for Modern GPUs

Cited by 15 publications

References 37 publications

Fast segmented sort on GPUs

Fast segmented sort on GPUs

Analyzing and Leveraging Shared L1 Caches in GPUs

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization

Contact Info

Product

Resources

About