CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Shen, Du; Song, Shuaiwen Leon; Li, Ang; Liu, Xu

doi:10.1145/3179541.3168831

Cited by 5 publications

(6 citation statements)

References 38 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is especially desired when porting traditional CPU-based HPC applications onto the new GPU-based exascale systems, such as Summit [6], Sierra [7] and Perlmutter [37]. As part of the community effort, we are planning to pursue these research directions in our future work with our past experience on GPU analytic modeling [38], [39], [40] and performance optimization [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51].…”

Section: Discussionmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

160

View full text Add to dashboard Cite

High performance multi-GPU computing becomes an inevitable trend due to the ever-increasing demand on computation capability in emerging domains such as deep learning, big data and planet-scale simulations. However, the lack of deep understanding on how modern GPUs can be connected and the real impact of state-of-the-art interconnect technology on multi-GPU application performance become a hurdle. In this paper, we fill the gap by conducting a thorough evaluation on five latest types of modern GPU interconnects: PCIe, NVLink-V1, NVLink-V2, NVLink-SLI and NVSwitch, from six high-end servers and HPC platforms: NVIDIA P100-DGX-1, V100-DGX-1, DGX-2, OLCF's SummitDev and Summit supercomputers, as well as an SLI-linked system with two NVIDIA Turing RTX-2080 GPUs. Based on the empirical evaluation, we have observed four new types of GPU communication network NUMA effects: three are triggered by NVLink's topology, connectivity and routing, while one is caused by PCIe chipset design issue. These observations indicate that, for an application running in a multi-GPU node, choosing the right GPU combination can impose considerable impact on GPU communication efficiency, as well as the application's overall performance. Our evaluation can be leveraged in building practical multi-GPU performance models, which are vital for GPU task allocation, scheduling and migration in a shared environment (e.g., AI cloud and HPC centers), as well as communication-oriented performance tuning.

show abstract

Section: Discussionmentioning

confidence: 99%

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Song

Chen

et al. 2020

IEEE Trans. Parallel Distrib. Syst.

Self Cite

160

View full text Add to dashboard Cite

show abstract

“…Compared to multithreaded shared-memory programs on CPUs, it is relatively complex to write efficient CUDA programs and utilize the GPU memory hierarchy. Several performance profiling tools help optimize CUDA programs [9,47,60,62], but these techniques do not help with concurrency correctness.…”

Section: Race Detection and Program Analyses On Gpusmentioning

confidence: 99%

Predictive Data Race Detection for GPUs

Dey¹,

Mayant²,

Sharma³

et al. 2021

Preprint

View full text Add to dashboard Cite

The high degree of parallelism and relatively complicated synchronization mechanisms in GPUs make writing correct kernels difficult. Data races pose one such concurrency correctness challenge, and therefore, effective methods of detecting as many data races as possible are required.Predictive partial order relations for CPU programs aim to expose data races that can be hidden during a dynamic execution. Existing predictive partial orders cannot be naïvely applied to analyze GPU kernels because of the differences in programming models. This work proposes GWCP, a predictive partial order for data race detection of GPU kernels. GWCP extends a sound and precise relation called weakcausally-precedes (WCP) proposed in the context of multithreaded shared memory CPU programs to GPU kernels. GWCP takes into account the GPU thread hierarchy and different synchronization semantics such as barrier synchronization and scoped atomics and locks.We implement a tool called PreDataR that tracks the GWCP relation using binary instrumentation. PreDataR includes three optimizations and a novel vector clock compression scheme that are readily applicable to other partial order based analyses. Our evaluation with several microbenchmarks and benchmarks shows that PreDataR has better data race coverage compared to prior techniques at practical run-time overheads.

show abstract

“…Yeh et al [8] instrument GPU code as it is generated by LLVM to identify redundant instructions. CUDAAdvisor [32] also instruments code as it is generated by LLVM to monitor GPU memory access and decide if bypassing could be used. GVProf [4] instruments GPU binaries to detect both temporal and spatial redundant value patterns.…”

Section: Related Workmentioning

confidence: 99%

“…Prior tools on GPUs [4, 8,32] provide fine-grained suggestions using instrumentation-based methods to quantify the severity of performance problems and locate problematic code. These tools identify one or a few patterns, such as redundant value/address, insufficient cache utilization, or memory transaction burst, but overlook others.…”

Section: Introductionmentioning

confidence: 99%

GPA: A GPU Performance Advisor Based on Instruction Sampling

Zhou

Meng

Sai

et al. 2020

Preprint

View full text Add to dashboard Cite

Developing efficient GPU kernels can be difficult because of the complexity of GPU architectures and programming models. Existing performance tools only provide coarse-grained suggestions at the kernel level, if any. In this paper, we describe GPA, a performance advisor for NVIDIA GPUs that suggests potential code optimization opportunities at a hierarchy of levels, including individual lines, loops, and functions. To relieve users of the burden of interpreting performance counters and analyzing bottlenecks, GPA uses data flow analysis to approximately attribute measured instruction stalls to their root causes and uses information about a program's structure and the GPU to match inefficiency patterns with suggestions for optimization. To quantify each suggestion's potential benefits, we developed PC samplingbased performance models to estimate its speedup. Our experiments with benchmarks and applications show that GPA provides an insightful report to guide performance optimization. Using GPA, we obtained speedups on a Volta V100 GPU ranging from 1.03× to 3.86×, with a geometric mean of 1.22×.

show abstract

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs

Cited by 5 publications

References 38 publications

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect

Predictive Data Race Detection for GPUs

GPA: A GPU Performance Advisor Based on Instruction Sampling

Contact Info

Product

Resources

About