CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Zhang, Jie; Gao, Shuwen; Kim, Nam Sung

doi:10.1109/ipdps.2018.00025

Cited by 6 publications

(1 citation statement)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Effective utilization of intra-and inter-warp data locality can improve on-chip cache hit rate, mitigate cache interference [6], reduce the number of costly off-chip accesses, and improve GPU performance [7]. The traditional LRR (Loose Round Robin) and GTO (Greedy Then Oldest) scheduling algorithms preserve inter-warp locality and intrawarp locality, respectively.…”

Section: Introductionmentioning

confidence: 99%

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

LIU

Zhao

et al. 2023

IEICE Trans. Fundamentals

View full text Add to dashboard Cite

GPUs have become the dominant computing units to meet the need of high performance in various computational fields. But the long operation latency causes the underutilization of on-chip computing resources, resulting in performance degradation when running parallel tasks on GPUs. A good warp scheduling strategy is an effective solution to hide latency and improve resource utilization. However, most current warp scheduling algorithms on GPUs ignore the ability of long operations to hide latency. In this paper, we propose a long-operation-first warp scheduling algorithm, LFWS, for GPU platforms. The LFWS filters warps in the ready state to a ready queue and updates the queue in time according to changes in the status of the warp. The LFWS divides the warps in the ready queue into long and short operation groups based on the type of operations in their instruction buffers, and it gives higher priority to the long-operating warp in the ready queue. This can effectively use the long operations to hide some of the latency from each other and enhance the system's ability to hide the latency. To verify the effectiveness of the LFWS, we implement the LFWS algorithm on a simulation platform GPGPU-Sim. Experiments are conducted over various CUDA applications to evaluate the performance of LFWS algorithm, compared with other five warp scheduling algorithms. The results show that the LFWS algorithm achieves an average performance improvement of 8.01% and 5.09%, respectively, over three traditional and two novel warp scheduling algorithms, effectively improving computational resource utilization on GPU.

show abstract

Section: Introductionmentioning

confidence: 99%

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

LIU

Zhao

et al. 2023

IEICE Trans. Fundamentals

View full text Add to dashboard Cite

show abstract

Static code transformations for thread‐dense memory accesses in GPU computing

Kim

Hong

Park

et al. 2019

Concurrency and Computation

View full text Add to dashboard Cite

Due to the GPU's complex memory system and massive thread-level parallelism, application programmers often have difficulty optimizing GPU programs. An essential approach to memory optimization is to utilize low-latency on-chip memory to avoid high latency of off-chip memory accesses. Shared memory is an on-chip memory, which is explicitly managed by programmers.Shared memory has a read/write latency similar to that of the L1 cache, but poor data management can degrade performance. In this paper, we present a static code transformation that preloads dataset in GPU's shared memory. Our static analysis primarily targets global memory requests with high thread-density for preloading in shared memory. The thread-dense memory access pattern is a pattern in which many threads efficiently manage the address space of shared memory, as well as reuse the same data in a thread block. We limit the usage of shared memory so that thread-level parallelism remains at the same level when selecting datasets for preloading. Finally, our source-to-source compiler allows to preload selected datasets in shared memory by transforming non-optimized GPU kernel code. Our methods achieve 1.26× and 1.62× speedups on average (geometric mean), respectively with GTX980 and P100 GPUs. KEYWORDScode transformation, GPU computing, shared memory, static analysis INTRODUCTIONGraphics processing units (GPUs) are very useful in accelerating scientific applications and even in accelerating machine learning applications.To take advantage of GPU computing power in these applications, it is essential to reduce the high-latency off-chip memory access. Several memory access transformation methods have been explored to optimize off-chip memory accesses by utilizing low-latency on-chip memory. 1-8The use of shared memory, which is located in on-chip and explicitly managed by user-written kernel codes, is one way to avoid the high-latency overhead within off-chip memory access. 9-16 Despite the beneficial characteristics of shared memory, applications often leave out shared memory unused, mainly due to the extra management of address space in shared memory. Domain-specific programmers prefer using the hardware-managed L1 cache rather than shared memory for programming simplicity. 2,17 None of 13 applications in PolyBench benchmark and only 14 of 23 applications in Rodinia benchmarks use shared memory. 18,19Complex memory system and massive thread-level parallelism often make it prohibitively difficult for domain-specific application programmers to optimize memory access patterns in GPU computing. Furthermore, GPU architectures are evolving rapidly, which makes developers rewrite GPU kernels for different generations. To overcome these hurdles, studies of compiler-based optimization and analysis tools are carried out to support programmers with no in-depth knowledge of the GPU architecture. 5,6,11,[20][21][22][23][24] This paper proposes a static code transformation for preloading data in shared memory of GPUs. Our software-only approach focuses primarily on off...

show abstract

GPU thread throttling for page-level thrashing reduction via static analysis

Kim,

Han

2023

J Supercomput

View full text Add to dashboard Cite

CIAO: Cache Interference-Aware Throughput-Oriented Architecture and Scheduling for GPUs

Cited by 6 publications

References 27 publications

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

LFWS: Long-Operation First Warp Scheduling Algorithm to Effectively Hide the Latency for GPUs

Static code transformations for thread‐dense memory accesses in GPU computing

GPU thread throttling for page-level thrashing reduction via static analysis

Contact Info

Product

Resources

About