2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) 2019
DOI: 10.1109/ispass.2019.00041
|View full text |Cite
|
Sign up to set email alerts
|

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Abstract: Training convolutional neural networks (CNNs) requires intense compute throughput and high memory bandwidth. Especially, convolution layers account for the majority of execution time of CNN training, and GPUs are commonly used to accelerate these layer workloads. GPU design optimization for efficient CNN training acceleration requires the accurate modeling of how their performance improves when computing and memory resources are increased. We present DeLTA, the first analytical model that accurately estimates … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
17
0

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 34 publications
(17 citation statements)
references
References 29 publications
0
17
0
Order By: Relevance
“…An example sentence for the computational platform description would be: “We use the PyTorch framework version XX, and NVIDIA YY GPUs with CUDA Toolkit ZZ (and nnU‐Net structure) for all experiments. Experiments were performed with 16‐bit precision and microbatching of size 10 to reduce memory usage.” 9–13 …”
Section: Methodsmentioning
confidence: 99%
See 1 more Smart Citation
“…An example sentence for the computational platform description would be: “We use the PyTorch framework version XX, and NVIDIA YY GPUs with CUDA Toolkit ZZ (and nnU‐Net structure) for all experiments. Experiments were performed with 16‐bit precision and microbatching of size 10 to reduce memory usage.” 9–13 …”
Section: Methodsmentioning
confidence: 99%
“…Experiments were performed with 16bit precision and microbatching of size 10 to reduce memory usage." [9][10][11][12][13]…”
Section: Computation Platform and Computation Time/complexitymentioning
confidence: 99%
“…Either using the roofline model or designing a heuristic performance model for these kernels turns out to be infeasible because of not only the lack of source code, but also the special tile quantization and wave quantization effects of cuBLAS [13]. In existing research (e.g., Lym et al [14]) on heuristic performance model design for proprietary libraries like cuDNN, many parameters are still opaque or extremely difficult to measure. Therefore, rather than heuristic ones, an ML-based performance model is more suitable in this case.…”
Section: A Gpu Operator and Kernel Performance Modelsmentioning
confidence: 99%
“…However, their superior performance has come at the cost of high computational and memory requirements [6], [7]. While convolutional neural networks (CNNs) on general purpose high-performance compute platforms such as GPUs are now ubiquitous [8], there has been increasing interest in domain-specific hardware accelerators [9] and alternate types of neural networks. In particular, Spiking Neural Network (SNN) accelerators have emerged as a potential low power alternative for AGI [10]- [13].…”
Section: Introductionmentioning
confidence: 99%