GPGPU performance and power estimation using machine learning

Wu, Gene; Greathouse, Joseph L.; Lyashevsky, Alexander; Jayasena, Nuwan; Chiou, Derek

doi:10.1109/hpca.2015.7056063

Cited by 173 publications

(90 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To analyze the performance bottlenecks of GPU-accelerated data parallel workloads, many GPU performance models have been proposed [8]- [11], [24]- [26]. In particular, prior models do in-depth analysis of the parallel GEMM and show high accuracy [8], [9].…”

Section: Motivation and Related Workmentioning

confidence: 99%

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Lym

Lee²,

O'Connor³

et al. 2019

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates the traffic at each GPU memory hierarchy level, while accounting for the complex reuse patterns of a parallel convolution algorithm. We demonstrate that our model is both accurate and robust for different CNNs and GPU architectures. We then show how this model can be used to carefully balance the scaling of different GPU resources for efficient CNN performance improvement.Index Terms-GPU, memory system, deep learning, CNN• We introduce DeLTA, a GPU performance model for CNNs. Unlike prior work, DeLTA accurately models traffic across all memory hierarchy levels, capturing the data reuse at the different levels; accurately modeling memory traffic is critical for future GPU designs where compute throughput and memory bandwidth must be balanced. • We are first to analyze and model the memory access pattern of the im2col convolution algorithm, which is the most-

show abstract

Section: Motivation and Related Workmentioning

confidence: 99%

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Lym

Lee²,

O'Connor³

et al. 2019

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

View full text Add to dashboard Cite

show abstract

“…Sethia et al [26] built a dynamic power management scheme and simulated them using GPGPU-Sim [27]. Wu et al [28] built GPU performance and power prediction models using Artificial Neural Networks. They found a set of hardware counters that could be used for such prediction schemes.…”

Section: Dynamic Voltage and Frequency Scalingmentioning

confidence: 99%

“…The performance counters are collected at an intermediate frequency value, since measuring counters at extreme frequencies results in collecting a less-representative behavior, as the application behavior can change drastically when operating at extreme frequencies [28].…”

Section: Chapter 5 Frameworkmentioning

confidence: 99%

“…Our performance counter selection is based on our own analysis, as well as strategies presented in previous studies [28,23]. …”

Section: Chapter 5 Frameworkmentioning

confidence: 99%

“…Such a scheme has been discussed previously in [28], but has not been implemented or evaluated on real hardware. In addition, we could study more fine-grained DVFS control using simulators such as Multi2Sim [36] or gem5-gpu [37].…”

Section: Chapter 7 Conclusion and Future Workmentioning

confidence: 99%

See 2 more Smart Citations

Energy efficient execution of heterogeneous applications

Baruah¹

View full text Add to dashboard Cite

ImageCL: Language and source‐to‐source compiler for performance portability, load balancing, and scalability prediction on heterogeneous systems

Falch

Elster

2017

Concurrency and Computation

View full text Add to dashboard Cite

SummaryApplications written for heterogeneous CPU-GPU systems often suffer from poor performance portability. Finding good work partitions can also be challenging as different devices are suited for different applications.This article describes ImageCL, a high-level domain-specific language and source-to-source compiler, targeting single system as well as distributed heterogeneous hardware. Initially targeting image processing algorithms, our framework now also handles general stencil-based operations. It resembles OpenCL, but abstracts away performance optimization details which instead are handled by our source-to-source compiler. Machine learning-based auto-tuning is used to determine which optimizations to apply. For the distributed case, by measuring performance counters on a small input on one device, previously trained performance models are used to predict the throughput of the application on multiple different devices, making it possible to balance the load evenly.Models for the communication overhead are created in a similar fashion and used to predict the optimal number of nodes to use.ImageCL outperforms other state-of-the-art solutions on image processing benchmarks in several cases, achieving speedups of up to 4.57×. On both CPUs and GPUs we are only 3% and 2% slower than an oracle for load balancing and scalability prediction, respectively, using synthetic benchmarks.

show abstract

GPGPU performance and power estimation using machine learning

Cited by 173 publications

References 48 publications

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Energy efficient execution of heterogeneous applications

ImageCL: Language and source‐to‐source compiler for performance portability, load balancing, and scalability prediction on heterogeneous systems

Contact Info

Product

Resources

About