DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Lym, Sangkug; Lee, Donghyuk; O'Connor, Mike; Chatterjee, Niladrish; Erez, Mattan

doi:10.1109/ispass.2019.00041

Cited by 34 publications

(17 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An example sentence for the computational platform description would be: “We use the PyTorch framework version XX, and NVIDIA YY GPUs with CUDA Toolkit ZZ (and nnU‐Net structure) for all experiments. Experiments were performed with 16‐bit precision and microbatching of size 10 to reduce memory usage.” 9–13 …”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

AI in medical physics: guidelines for publication

et al. 2021

View full text Add to dashboard Cite

The Abstract is intended to provide a concise summary of the study and its scientific findings. For AI/ML applications in medical physics, a problem statement and rationale for utilizing these algorithms are necessary while highlighting the novelty of the approach. A brief numerical description of how the data are partitioned into subsets for training of the AI/ML algorithm, validation (including tuning of parameters), and independent testing of algorithm performance is required. This is to be followed by a summary of the results and statistical metrics that quantify the performance of the AI/ML algorithm.

show abstract

Section: Methodsmentioning

confidence: 99%

“…Experiments were performed with 16bit precision and microbatching of size 10 to reduce memory usage." [9][10][11][12][13]…”

Section: Computation Platform and Computation Time/complexitymentioning

confidence: 99%

AI in medical physics: guidelines for publication

et al. 2021

View full text Add to dashboard Cite

show abstract

“…Either using the roofline model or designing a heuristic performance model for these kernels turns out to be infeasible because of not only the lack of source code, but also the special tile quantization and wave quantization effects of cuBLAS [13]. In existing research (e.g., Lym et al [14]) on heuristic performance model design for proprietary libraries like cuDNN, many parameters are still opaque or extremely difficult to measure. Therefore, rather than heuristic ones, an ML-based performance model is more suitable in this case.…”

Section: A Gpu Operator and Kernel Performance Modelsmentioning

confidence: 99%

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Lin¹,

Liu²,

Ardestani³

et al. 2022

Preprint

View full text Add to dashboard Cite

We devise a performance model for GPU training of Deep Learning Recommendation Models (DLRM), whose GPU utilization is low compared to other well-optimized CV and NLP models. We show that both the device active time (the sum of kernel runtimes) and the device idle time are important components of the overall device time. We therefore tackle them separately by (1) flexibly adopting heuristic-based and ML-based kernel performance models for operators that dominate the device active time, and (2) categorizing operator overheads into five types to determine quantitatively their contribution to the device active time. Combining these two parts, we propose a critical-path-based algorithm to predict the per-batch training time of DLRM by traversing its execution graph. We achieve less than 10% geometric mean average error (GMAE) in all kernel performance modeling, and 5.23% and 7.96% geomean errors for GPU active time and overall end-to-end per-batch training time prediction, respectively. We show that our general performance model not only achieves low prediction error on DLRM, which has highly customized configurations and is dominated by multiple factors, but also yields comparable accuracy on other compute-bound ML models targeted by most previous methods. Using this performance model and graph-level data and task dependency analyses, we show our system can provide more general model-system co-design than previous methods.

show abstract

“…However, their superior performance has come at the cost of high computational and memory requirements [6], [7]. While convolutional neural networks (CNNs) on general purpose high-performance compute platforms such as GPUs are now ubiquitous [8], there has been increasing interest in domain-specific hardware accelerators [9] and alternate types of neural networks. In particular, Spiking Neural Network (SNN) accelerators have emerged as a potential low power alternative for AGI [10]- [13].…”

Section: Introductionmentioning

confidence: 99%

Training Energy-Efficient Deep Spiking Neural Networks with Single-Spike Hybrid Input Encoding

Datta¹,

Kundu²,

A.³

2021

Preprint

View full text Add to dashboard Cite

Spiking Neural Networks (SNNs) have emerged as an attractive alternative to traditional deep learning frameworks, since they provide higher computational efficiency in event driven neuromorphic hardware. However, the state-of-the-art (SOTA) SNNs suffer from high inference latency, resulting from inefficient input encoding and training techniques. The most widely used input coding schemes, such as Poisson based rate-coding, do not leverage the temporal learning capabilities of SNNs. This paper presents a training framework for low-latency energyefficient SNNs that uses a hybrid encoding scheme at the input layer in which the analog pixel values of an image are directly applied during the first timestep and a novel variant of spike temporal coding is used during subsequent timesteps. In particular, neurons in every hidden layer are restricted to fire at most once per image which increases activation sparsity. To train these hybrid-encoded SNNs, we propose a variant of the gradient descent based spike timing dependent backpropagation (STDB) mechanism using a novel cross entropy loss function based on both the output neurons' spike time and membrane potential. The resulting SNNs have reduced latency and high activation sparsity, yielding significant improvements in computational efficiency. In particular, we evaluate our proposed training scheme on image classification tasks from CIFAR-10 and CIFAR-100 datasets on several VGG architectures. We achieve top-1 accuracy of 66.46% with 5 timesteps on the CIFAR-100 dataset with ∼125× less compute energy than an equivalent standard ANN. Additionally, our proposed SNN performs 5-300× faster inference compared to other state-of-the-art rate or temporally coded SNN models.

show abstract

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

Cited by 34 publications

References 29 publications

AI in medical physics: guidelines for publication

AI in medical physics: guidelines for publication

Building a Performance Model for Deep Learning Recommendation Model Training on GPUs

Training Energy-Efficient Deep Spiking Neural Networks with Single-Spike Hybrid Input Encoding

Contact Info

Product

Resources

About