Bounding the effect of partition camping in GPU kernels

Aji, Ashwin M.; Daga, Mayank; Feng, Wu-chun

doi:10.1145/2016604.2016637

Cited by 16 publications

(4 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Similarly, the second kernel used 64 × 4 as the thread block size instead of the original 256 × 1 size. We chose 64 instead of 32 because the thread block size 32 × 4 causes partition camping in this kernel, which can degrade the kernel performance by as much as sevenfold .…”

Section: Resultsmentioning

confidence: 99%

A parallel scheme for accelerating parameter sweep applications on a GPU

Ino

Shigeoka

Okuyama

et al. 2013

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYThis paper proposes a parallel scheme for accelerating parameter sweep applications on a graphics processing unit. By using hundreds of cores on the graphics processing unit, we found that our scheme simultaneously processes multiple parameters rather than a single parameter. The simultaneous sweeps exploit the similarity of computing behaviors shared by different parameters, thus allowing memory accesses to be coalesced into a single access if similar irregularities appear among the parameters’ computational tasks. In addition, our scheme reduces the amount of off‐chip memory access by unifying the data that are commonly referenced by multiple parameters and by placing the unified data in the fast on‐chip memory. In several experiments, we applied our scheme to practical applications and found that our scheme can perform up to 8.5 times faster than a naive scheme that processes a single parameter at a time. We also include a discussion on application characteristics that are required for our scheme to outperform the naive scheme. Copyright © 2013 John Wiley & Sons, Ltd.

show abstract

Section: Resultsmentioning

confidence: 99%

A parallel scheme for accelerating parameter sweep applications on a GPU

Ino

Shigeoka

Okuyama

et al. 2013

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…Further, we model CPU-GPU memory copy engine which we found is an important factor on L2 accesses and hit rate, since all DRAM accesses go through the L2, including CPU-GPU memory copies [16]. In order to reduce uneven accesses across memory partitions [17], we add an advanced partition indexing that xors the L2 channel bits with randomly selected bits from the higher row and lower bank bits [18]. In the memory system, we accurately model HBM.…”

Section: Methodsmentioning

confidence: 99%

Exploring Modern GPU Memory System Design Challenges through Accurate Modeling

Khairy,

Akshay,

Aamodt

et al. 2018

Preprint

View full text Add to dashboard Cite

This paper explores the impact of simulator accuracy on architecture design decisions in the general-purpose graphics processing unit (GPGPU) space. We perform a detailed, quantitative analysis of the most popular publicly available GPU simulator, GPGPU-Sim, against our enhanced version of the simulator, updated to model the memory system of modern GPUs in more detail. Our enhanced GPU model is able to describe the NVIDIA Volta architecture in sufficient detail to reduce error in memory system even counters by as much as 66×. The reduced error in the memory system further reduces execution time error versus real hardware by 2.5×. To demonstrate the accuracy of our enhanced model against a real machine, we perform a counter-by-counter validation against an NVIDIA TITAN V Volta GPU, demonstrating the relative accuracy of the new simulator versus the publicly available model.We go on to demonstrate that the simpler model discounts the importance of advanced memory system designs such as out-of-order memory access scheduling, while overstating the impact of more heavily researched areas like L1 cache bypassing. Our results demonstrate that it is important for the academic community to enhance the level of detail in architecture simulators as system complexity continues to grow. As part of this detailed correlation and modeling effort, we developed a new Correlator toolset that includes a consolidation of applications from a variety of popular GPGPU benchmark suites, designed to run in reasonable simulation times. The Correlator also includes a database of hardware profiling results for all these applications on NVIDIA cards ranging from Fermi to Volta and a toolchain that enables users to gather correlation statistics and create detailed counter-by-counter hardware correlation plots with minimal effort.

show abstract

“…Performance Analysis and Tuning: Researchers have proposed several techniques to analyze GPU performance from various aspects, including branching, degree of coalescing, race conditions, bank conflict, and partition camping [2], [5], [18]. They provide helpful information for the user to identify potential bottlenecks.…”

Section: Introductionmentioning

confidence: 99%

Online Performance Projection for Clusters with Heterogeneous GPUs

Panwar

Aji

Meng

et al. 2013

2013 International Conference on Parallel and Distributed Systems

Self Cite

View full text Add to dashboard Cite

Abstract-We present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases of this technique include scheduling or migrating GPU workloads over a heterogeneous cluster with different types of GPUs.

show abstract

Bounding the effect of partition camping in GPU kernels

Cited by 16 publications

References 15 publications

A parallel scheme for accelerating parameter sweep applications on a GPU

A parallel scheme for accelerating parameter sweep applications on a GPU

Exploring Modern GPU Memory System Design Challenges through Accurate Modeling

Online Performance Projection for Clusters with Heterogeneous GPUs

Contact Info

Product

Resources

About