MGPUSim

Sun, Yifan; Baruah, Trinayan; Mojumder, Saiful A.; Dong, Shi; Gong, Xiang; Treadway, Shane; Bao, Yuhui; Hance, Spencer; McCardwell, Carter; Zhao, Vincent; Barclay, Harrison; Ziabari, Amir Kavyan; Chen, Zhongliang; Ubal, Rafael; Abellán, José Luis; Kim, John; Joshi, Ajay; Kaeli, David

doi:10.1145/3307650.3322230

Cited by 61 publications

(5 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NVArchSim (NVAS) [44] is the proprietary hybrid trace-driven simulator used by Nvidia in which different levels of abstraction (detailed versus high-abstraction timing models) are deployed to balance simulation speed and accuracy. MGPUSim [41] is a parallel simulator for modeling multi-GPU systems.…”

Section: Related Workmentioning

confidence: 99%

“…Sampling, through which a limited number of representative regions are simulated, is a widely used methodology. While there exists a large body of work on sampled simulation for CPUs [16]- [20], [24], [38], [39], [51], sampling techniques specifically developed and tailored for speeding up GPU simulation have only recently received attention, see in particular [23], [25], [41], [44]. The stateof-the-art GPU workload sampling methodology, and most closely related work compared to ours, is Principal Kernel Selection (PKS) [11] which was shown to yield high accuracy and high speed for a variety of GPU-compute workloads.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Sieve: Stratified GPU-Compute Workload Sampling

Naderan-Tahan

SeyyedAghaei

Eeckhout

2023

2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

To exploit the ever increasing compute capabilities offered by GPU hardware, GPU-compute workloads have evolved from simple computational kernels to large-scale programs with complex software stacks and numerous kernels. Driving architecture exploration using real workloads hence becomes increasingly challenging, up to the point of becoming intractable because of extremely long simulation times using existing architecture simulators. Sampling is a widely used technique to speed up simulation, however, the state-of-the-art sampling method for GPU-compute workloads, Principal Kernel Selection (PKS), falls short for challenging GPU-compute workloads with a large number of kernels and kernel invocations.This paper presents Sieve, an accurate and low-overhead stratified sampling methodology for GPU-compute workloads that groups kernel invocations based on their instruction count, with the goal of minimizing the execution time variability within strata. For the challenging Cactus and MLPerf workloads, we report that Sieve achieves an average prediction error of 1.2% (and at most 3.2%) versus 16.5% (and up to 60.4%) for PKS on real hardware (Nvidia Ampere GPU), while maintaining a similar simulation speedup of three orders of magnitude. We further demonstrate that Sieve reduces profiling time by a factor of 8× (and up to 98×) compared to PKS.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Sieve: Stratified GPU-Compute Workload Sampling

Naderan-Tahan

SeyyedAghaei

Eeckhout

2023

2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

show abstract

“…The AMD GCN architecture [6] is related to OpenCL platform model. A GPU device consists of several compute units.…”

Section: The Amd Gcn Architecturementioning

confidence: 99%

A method for decompilation of AMD GCN kernels to OpenCL

Mihajlenko

Lukin

Stankevich

2021

ICS

View full text Add to dashboard Cite

Introduction: Decompilers are useful tools for software analysis and support in the absence of source code. They are available for many hardware architectures and programming languages. However, none of the existing decompilers support modern AMD GPU architectures such as AMD GCN and RDNA. Purpose: We aim at developing the first assembly decompiler tool for a modern AMD GPU architecture that generates code in the OpenCL language, which is widely used for programming GPGPUs. Results: We developed the algorithms for the following operations: preprocessing assembly code, searching data accesses, extracting systemvalues, decompiling arithmetic operations and recovering data types. We also developed templates for decompilation of branching operations. Practical relevance: We implemented the presented algorithms in Python as a tool called OpenCLDecompiler, which supports a large subset of AMD GCN instructions. This tool automatically converts disassembled GPGPU code into the equivalent OpenCL code, which reduces the effort required to analyze assembly code.

show abstract

“…We characterize a BFS application from the SHOC Benchmark Suit [11] on a real-world graph and collect the memory trace with a GPU simulator [58]. This BFS application uses CSR graph format and warp-centric execution, similar to Grus.…”

Section: Adaptive Um Policymentioning

confidence: 99%

Grus

Wang

et al. 2021

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Today’s GPU graph processing frameworks face scalability and efficiency issues as the graph size exceeds GPU-dedicated memory limit. Although recent GPUs can over-subscribe memory with Unified Memory (UM), they incur significant overhead when handling graph-structured data. In addition, many popular processing frameworks suffer sub-optimal efficiency due to heavy atomic operations when tracking the active vertices. This article presents Grus, a novel system framework that allows GPU graph processing to stay competitive with the ever-growing graph complexity. Grus improves space efficiency through a UM trimming scheme tailored to the data access behaviors of graph workloads. It also uses a lightweight frontier structure to further reduce atomic operations. With easy-to-use interface that abstracts the above details, Grus shows up to 6.4× average speedup over the state-of-the-art in-memory GPU graph processing framework. It allows one to process large graphs of 5.5 billion edges in seconds with a single GPU.

show abstract

MGPUSim

Cited by 61 publications

References 40 publications

Sieve: Stratified GPU-Compute Workload Sampling

Sieve: Stratified GPU-Compute Workload Sampling

A method for decompilation of AMD GCN kernels to OpenCL

Grus

Contact Info

Product

Resources

About