Workload Analysis and Prediction of Multi-type GPU in Heterogeneous GPU Clusters

Wang, Sheng; Chen, Shiping; Shi, Yumei

doi:10.21203/rs.3.rs-2266264/v1

Cited by 15 publications

(6 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We considered a subset of the ImageNet competition dataset [38], including 10 classes with 1300 images each. We trained jobs exploiting ResNet [39], VGG16 [40], AlexNet [41] and MobileNetV2 [42], varying the batch size (16,32,64) and the optimizer (Adam, SGD). We set to 100 the maximum number of epochs for all jobs and record after how many epochs the jobs terminate using a stopping criterion based on patience.…”

Section: A Experimental Setup and Methodologymentioning

confidence: 99%

“…Note that, while deciding to deploy jobs on single nodes may seem a limitation, recent DL workloads analysis [32] highlighted how over 50% of jobs require a single GPU, while jobs exploiting more than 8 GPUs (which we will consider as the maximum node size in our experimental evaluation) are less than 10%. Moreover, enforcing GPU locality yields over 10× speed-up [16].…”

Section: Resource Selection-job Scheduling Problemmentioning

confidence: 99%

“…While resource partitioning inevitably introduces some overheads, these can usually be neglected when jobs run on dedicated GPUs in the same machine (see, e.g., [11]- [15]). The interference induced by GPU sharing is more significant, nevertheless, the GPUs available on the market are increasingly performing and have a large amount of memory, which enables multiple mini-jobs to be trained simultaneously, optimizing the energy consumption and resource usage albeit slightly increasing the execution times [16].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

Filippini,

Anselmi,

Ardagna

et al. 2024

IEEE Trans. Cloud Comput.

View full text Add to dashboard Cite

In this work, we optimize the scheduling of Deep Learning (DL) training jobs from the perspective of a Cloud Service Provider running a data center, which efficiently selects resources for the execution of each job to minimize the average energy consumption while satisfying time constraints. To model the problem, we first develop a Mixed-Integer Non-Linear Programming formulation. Unfortunately, the computation of an optimal solution is prohibitively expensive, and to overcome this difficulty, we design a heuristic STochastic Scheduler (STS). Exploiting the probability distribution of early termination, STS determines how to adapt the resource assignment during the execution of the jobs to minimize the expected energy cost while meeting the job due dates. The results of an extensive experimental evaluation show that STS guarantees significantly better results than other methods in the literature, effectively avoiding due date violations and yielding a percentage total cost reduction between 32% and 80% on average. We also prove the applicability of our method in real-world scenarios, as obtaining optimal schedules for systems of up to 100 nodes and 400 concurrent jobs requires less than 5 seconds. Finally, we evaluated the effectiveness of GPU sharing, i.e., running multiple jobs in a single GPU. The obtained results demonstrate that depending on the workload and GPU memory, this further reduces the energy cost by 17-29% on average.

show abstract

Section: A Experimental Setup and Methodologymentioning

confidence: 99%

Section: Resource Selection-job Scheduling Problemmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

Filippini,

Anselmi,

Ardagna

et al. 2024

IEEE Trans. Cloud Comput.

View full text Add to dashboard Cite

show abstract

“…To compute the SLO attainment with a given set of requests and placement, in AlpaServe, we assume we know the arrival process in advance. Although short-term burstiness is impossible to predict, the arrival pattern over longer timescales (e.g., hours or days) is often predictable [43]. Given this predictability, AlpaServe either directly uses the history request traces or fits a distribution from the trace and resamples new traces from the distribution as the input workload to the simulator to compute the SLO attainment.…”

Section: Placement Algorithmmentioning

confidence: 99%

“…Furthermore, there is significant and unpredictable burstiness in the arrival process of user requests. To meet tight SLO, contemporary serving systems are forced to over-provision compute resources, resulting in low cluster utilization [43].…”

Section: Introductionmentioning

confidence: 99%

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Li¹,

Zheng²,

Zhong³

et al. 2023

Preprint

View full text Add to dashboard Cite

Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10× higher rates or 6× more burstiness while staying within latency constraints for more than 99% of requests.

show abstract

Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads

Wiesner

Scheinert

Wittkopp

et al. 2022

Euro-Par 2022: Parallel Processing

View full text Add to dashboard Cite

The growing electricity demand of cloud and edge computing increases operational costs and will soon have a considerable impact on the environment. A possible countermeasure is equipping IT infrastructure directly with on-site renewable energy sources. Yet, particularly smaller data centers may not be able to use all generated power directly at all times, while feeding it into the public grid or energy storage is often not an option. To maximize the usage of renewable excess energy, we propose Cucumber, an admission control policy that accepts delaytolerant workloads only if they can be computed within their deadlines without the use of grid energy. Using probabilistic forecasting of computational load, energy consumption, and energy production, Cucumber can be configured towards more optimistic or conservative admission. We evaluate our approach on two scenarios using real solar production forecasts for Berlin, Mexico City, and Cape Town in a simulation environment. For scenarios where excess energy was actually available, our results show that Cucumber's default configuration achieves acceptance rates close to the optimal case and causes 97.0 % of accepted workloads to be powered using excess energy, while more conservative admission results in 18.5 % reduced acceptance at almost zero grid power usage.

show abstract

Workload Analysis and Prediction of Multi-type GPU in Heterogeneous GPU Clusters

Cited by 15 publications

References 23 publications

A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

A Stochastic Approach for Scheduling AI Training Jobs in GPU-Based Systems

AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving

Cucumber: Renewable-Aware Admission Control for Delay-Tolerant Cloud and Edge Workloads

Contact Info

Product

Resources

About