A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

Braun, Lorenz; Nikas, Sotirios; Song, Chen; Heuveline, Vincent; Fröning, Holger

doi:10.1145/3431731

Cited by 21 publications

(12 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Dublish et al [13] and Ardalani et al [5] used regression to predict the performance of GPU architecture. Braun et al [8] and O'neal et al [40] proposed a model based on random forests. Wu et al [53] exploited neural networks, and Guerreiro et al [21] developed a recurrent neural network-based model, which takes the sequence of PTX instructions.…”

Section: Ml-based Performance Evaluationmentioning

confidence: 99%

GCoM

Lee

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

Analytical models can greatly help computer architects perform orders of magnitude faster early-stage design space exploration than using cycle-level simulators. To facilitate rapid design space exploration for graphics processing units (GPUs), prior studies have proposed GPU analytical models which capture first-order stall events causing performance degradation; however, the existing analytical models cannot accurately model modern GPUs due to their outdated and highly abstract GPU core microarchitecture assumptions. Therefore, to accurately evaluate the performance of modern GPUs, we need a new GPU analytical model which accurately captures the stall events incurred by the significant changes in the core microarchitectures of modern GPUs.We propose GCoM, an accurate GPU analytical model which faithfully captures the key core-side stall events of modern GPUs. Through detailed microarchitecture-driven GPU core modeling, GCoM accurately models modern GPUs by revealing the following key core-side stalls overlooked by the existing GPU analytical models. First, GCoM identifies the compute structural stall events caused by the limited per-sub-core functional units. Second, GCoM exposes the memory structural stalls due to the limited banks and shared nature of per-core L1 data caches. Third, GCoM correctly predicts the memory data stalls induced by the sectored L1 data caches which split a cache line into a set of sectors sharing the same tag. Fourth, GCoM captures the idle stalls incurred by the inter-and intra-core load imbalances. Our experiments using an NVIDIA RTX 2060 configuration show that GCoM greatly improves the modeling accuracy by achieving a mean absolute error of 10.0% against Accel-Sim cycle-level simulator, whereas the state-of-the-art GPU analytical model achieves a mean absolute error of 44.9%.

show abstract

Section: Ml-based Performance Evaluationmentioning

confidence: 99%

GCoM

Lee

et al. 2022

Proceedings of the 49th Annual International Symposium on Computer Architecture

View full text Add to dashboard Cite

show abstract

“…Approximation techniques have also been used over the years to accelerate the simulation of individual components [11], [12]. E.g., by using simple core models (also known as 1-IPC core models) such as those implemented in Sniper [13] and CMP$im [14] if the interest is placed on evaluating the cache hierarchy or the memory system.…”

Section: B Overview Of Simulation Techniquesmentioning

confidence: 99%

MEGsim: A Novel Methodology for Efficient Simulation of Graphics Workloads in GPUs

Ortíz

Corbalan-Navarro

Aragón

et al. 2022

2022 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

An important drawback of cycle-accurate microarchitectural simulators is that they are several orders of magnitude slower than the system they model. This becomes an important issue when simulations have to be repeated multiple times sweeping over the desired design space. In the specific context of graphics workloads, performing cycle-accurate simulations are even more demanding due to the high number of triangles that have to be shaded, lighted and textured to compose a single frame. As a result, simulating a few minutes of a video game sequence is extremely time-consuming.In this paper, we make the observation that collecting information about the vertices and primitives that are processed, along with the times that shader programs are invoked, allows us to characterize the activity performed on a given frame. Based on that, we propose a novel methodology for the efficient simulation of graphics workloads called MEGsim, an approach that is capable of accurately characterizing entire video sequences by using a small subset of selected frames which substantially drops the simulation time. For a set of popular Android games, we show that MEGsim achieves an average simulation speedup of 126×, achieving remarkably accurate results for the estimated final statistics, e.g., with average relative errors of just 0.84% for the total number of cycles, 0.99% for the number of DRAM accesses, 1.2% for the number of L2 cache accesses, and 0.86% for the number of L1 (tile cache) accesses.

show abstract

“…Baldini et al [5] use existing OpenMP applications and supervised learning to predict the potential GPU execution speedup among different vendors. Brown et al [8] present a model that allows to get accurate predictions of speedups using a small set of features, while also being portable portability across Nvidia GPUs with different capabilities. Adams et al [1] propose a novel scheduling algorithm for the Halide programming language that targets image processing pipelines.…”

Section: Related Workmentioning

confidence: 99%

Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes

Papadimitriou

Markou

Fumero

et al. 2021

Proceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments

View full text Add to dashboard Cite

Modern commodity devices are nowadays equipped with a plethora of heterogeneous devices serving different purposes. Being able to exploit such heterogeneous hardware accelerators to their full potential is of paramount importance in the pursuit of higher performance and energy efficiency. Towards these objectives, the reduction of idle time of each device as well as the concurrent program execution across different accelerators can lead to better scalability within the computing platform.In this work, we propose a novel approach for enabling a Java-based heterogeneous managed runtime to automatically and efficiently deploy multiple tasks on multiple devices. We extend TornadoVM with parallel execution of bytecode interpreters to dynamically and concurrently manage and execute arbitrary tasks across multiple OpenCL-compatible devices. In addition, in order to achieve an efficient devicetask allocation, we employ a machine learning approach with a multiple-classification architecture of Extra-Trees-Classifiers. Our proposed solution has been evaluated over a suite of 12 applications split into three different groups. Our experimental results showcase performance improvements up 83% compared to all tasks running on the single best device, while reaching up to 91% of the oracle performance.

show abstract

A Simple Model for Portable and Fast Prediction of Execution Time and Power Consumption of GPU Kernels

Cited by 21 publications

References 41 publications

GCoM

GCoM

MEGsim: A Novel Methodology for Efficient Simulation of Graphics Workloads in GPUs

Multiple-tasks on multiple-devices (MTMD): exploiting concurrency in heterogeneous managed runtimes

Contact Info

Product

Resources

About