The MIT Supercloud Dataset

Samsi, Siddharth; Weiss, Matthew L.; Bestor, David; Li, Baolin; Jones, Michael; Reuther, Albert; Edelman, Daniel; Arcand, William; Byun, Chansup; Holodnack, John; Hubbell, Matthew; Kepner, Jeremy; Klein, Anna; McDonald, Joseph; Michaleas, Adam; Michaleas, Peter; Milechin, Lauren; Mullen, Julia; Yee, Charles; Price, Benjamin W.; Prout, Andrew; Rosa, Antonio; Vanterpool, Allan; McEvoy, Lindsey; Cheng, Anson; Tiwari, Devesh; Gadepally, Vijay

doi:10.1109/hpec49654.2021.9622850

Cited by 18 publications

(10 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given the intensive compute resources required to conduct such scaling studies, we intend to make all experimental data from this study publicly available. as part of the MIT Supercloud Datacenter Challenge [35] via this https URL.…”

Section: Discussionmentioning

confidence: 99%

“…Traditionally, HPC centers limit GPU usage to prevent users from misusing systems, while cloud providers eagerly allow users to provision as many resources as they can afford. Rarely do scientific DL practitioners examine their resource needs; most workflows are either run on a single GPU due to the lack of engineering infrastructure needed to scale, or are run on the maximum number of available GPUs [18,35]. Efficient training and scaling strategies may be even more important than architecture details in some domains [4,31,43].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Frey¹,

Li²,

McDonald³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

Deep learning (DL) workflows demand an everincreasing budget of compute and energy in order to achieve outsized gains. Neural architecture searches, hyperparameter sweeps, and rapid prototyping consume immense resources that can prevent resource-constrained researchers from experimenting with large models and carry considerable environmental impact. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications.In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural language processing, computer vision, and chemistry-on up to 424 graphics processing units (GPUs). During training, our experiments systematically vary compute resource characteristics and energy-saving mechanisms such as power utilization and GPU clock rate limits to capture and illustrate the different trade-offs and scaling behaviors each representative model exhibits under various resource and energyconstrained regimes. We fit power law models that describe how training time scales with available compute resources and energy constraints. We anticipate that these findings will help inform and guide highperformance computing providers in optimizing resource utilization, by selectively reducing energy consumption for different deep learning tasks/workflows with minimal impact on training.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Frey¹,

Li²,

McDonald³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…We use various types of deep learning (DL) workloads because the recent advancement in DL algorithms has made them popular in scientific research and production datacenters [41][42][43]. We uniformly sample the DL model and training batch size from Table 2.…”

Section: Methodsmentioning

confidence: 99%

Using Multi-Instance GPU for Efficient Operation of Multi-Tenant GPU Clusters

Li¹,

Patel²,

Samsi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability of NVIDIA A100 GPUs to dynamically partition GPU resources among co-located jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49% and 16% lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively.

show abstract

“…We use various types of deep learning (DL) workloads because the recent advancement in DL algorithms has made them popular in scientific research and production datacenters [43][44][45]. We uniformly sample the DL model and training batch size from Table 2.…”

Section: Methodsmentioning

confidence: 99%

Miso

Patel

Samsi

et al. 2022

Proceedings of the 13th Symposium on Cloud Computing

Self Cite

View full text Add to dashboard Cite

GPU technology has been improving at an expedited pace in terms of size and performance, empowering HPC and AI/ML researchers to advance the scientific discovery process. However, this also leads to inefficient resource usage, as most GPU workloads, including complicated AI/ML models, are not able to utilize the GPU resources to their fullest extent -encouraging support for GPU multi-tenancy. We propose MISO, a technique to exploit the Multi-Instance GPU (MIG) capability on the latest NVIDIA datacenter GPUs (e.g., A100, H100) to dynamically partition GPU resources among colocated jobs. MISO's key insight is to use the lightweight, more flexible Multi-Process Service (MPS) capability to predict the best MIG partition allocation for different jobs, without incurring the overhead of implementing them during exploration. Due to its ability to utilize GPU resources more efficiently, MISO achieves 49% and 16% lower average job completion time than the unpartitioned and optimal static GPU partition schemes, respectively. CCS CONCEPTS• Computer systems organization → Cloud computing.

show abstract

The MIT Supercloud Dataset

Cited by 18 publications

References 17 publications

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Benchmarking Resource Usage for Efficient Distributed Deep Learning

Using Multi-Instance GPU for Efficient Operation of Multi-Tenant GPU Clusters

Miso

Contact Info

Product

Resources

About