Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Choi, Jake; Yeom, Heon Y.; Kim, Yoonhee

doi:10.1007/s10586-022-03805-x

Cited by 2 publications

(2 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The dataset comprised the training (800 images) and test (200 images) sets. The former was used to train the model using the Pytorch framework [21] on an NVIDIA GeForce GTX 3090Ti GPU for tting the parameters of the model. The latter was used to evaluate the performance of the model by comparing its results with manually measured results.…”

Section: Model Trainingmentioning

confidence: 99%

Accurate automatic measurement of spinopelvic parameters with a one-stage deep learning technique

Meng,

Liu,

feng

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: The current method of measuring parameters in spinal imaging manually is time-consuming and prone to inconsistencies. This study proposed and validated a novel method to automate the measurement of pelvic parameters using a one-stage deep learning (DL) model. Methods: Spinopelvic parameters, including pelvic incidence (PI), sacral slope (SS), and pelvic tilt (PT), were measured from full body radiographs of patients by three evaluators and by using our proposed method. Our proposed one-stage DL model was based on keypoint localisation. Landmark localisation error was used to evaluate the performance of landmark localisation. To evaluate the agreement between our method and the human evaluators, the analysis of average error, standard deviation, and intra- and inter-evaluator reliability was conducted using the intraclass correlation coefficient (ICC) and Pearson's correlation coefficient (R). Results:The method achieved excellent measurement performance for spinopelvic parameters. The distribution of the landmark localisation errors was within a reasonable range (median error, 2.28–4.01 mm). ICC values for the assessment of the intra- (range: 0.941–0.996) and inter-evaluator (0.994–0.998) reliability of human evaluators were excellent. The method was able to determine spinopelvic parameters with excellent ICC values (0.919-0.997) and R value (R >0.899, p<0.001, all). Meanwhile, the detection speed of the algorithm was approximately 30 times faster than that of manual measurements of spinopelvic parameters. Conclusions:This one-step automated measurement method is less time-consuming and has excellent reliability and agreement with human evaluators.

show abstract

Section: Model Trainingmentioning

confidence: 99%

Accurate automatic measurement of spinopelvic parameters with a one-stage deep learning technique

Meng,

Liu,

feng

et al. 2024

Preprint

View full text Add to dashboard Cite

show abstract

“…Additionally, UM supports GPU memory oversubscription, i.e., GPU kernels access more data than the GPU memory can hold, significantly enhancing programming portability and productivity for memory-demanding workloads. UM technologies have been adopted by HPC frameworks such as Raja [6], Kokkos [9], and Trilinos [16] for writing portable applications on today's and future's major HPC platforms, and by deep learning frameworks [12,22,34]. However, even with active research and improvement by vendors and research community [3,18,23,42], current UM technologies cause significant, or even prohibitive, performance degradation [25,26,46].…”

Section: Introductionmentioning

confidence: 99%

Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications

Cooper,

Scogland,

2024

Proceedings of the 38th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

Discrete GPU accelerators, while providing massive computing power for supercomputers and data centers, have their separate memory domain. Explicit memory management across device and host domains in programming is tedious and error-prone. To improve programming portability and productivity, Unified Memory (UM) integrates GPU memory into the host virtual memory systems, and provides transparent data migration between them and GPU memory oversubscription. Nevertheless, current UM technologies cause significant performance loss for applications. With AMD GPUs increasingly being integrated into the world's leading supercomputers, it is necessary to understand their Shared Virtual Memory (SVM) and mitigate the performance impacts. In this work, we delve into the SVM design, examine its interactions with applications' data accesses at fine granularity, and quantitatively analyze its performance effects on various applications and identify the performance bottlenecks. Our research reveals that SVM employs an aggressive prefetching strategy for demand paging. This prefetching is efficient when GPU memory is not oversubscribed. However, in tandem with the eviction policy, it causes excessive thrashing and performance degradation for certain applications under oversubscription. We discuss SVM-aware algorithms and SVM design changes to mitigate the performance impacts. To the best of our knowledge, this work is the first in-depth and comprehensive study for SVM technologies. CCS CONCEPTS• Computer systems organization → Heterogeneous (hybrid) systems; Processors and memory architectures; • Software and its engineering → Memory management; • Hardware → Hardware accelerators.

show abstract

Improving Oversubscribed GPU Memory Performance in the PyTorch Framework

Cited by 2 publications

References 19 publications

Accurate automatic measurement of spinopelvic parameters with a one-stage deep learning technique

Accurate automatic measurement of spinopelvic parameters with a one-stage deep learning technique

Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications

Contact Info

Product

Resources

About