Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

Yang, Renyu; Hu, Chunming; Sun, Xiaoyang; Garraghan, Peter; Wo, Tianyu; Wen, Zhenyu; Peng, Hao; Xu, Jie; Li, Chao

doi:10.1109/tpds.2020.2970013

Cited by 23 publications

(28 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Monitoring. Monitoring is the key to application aware optimization [10], [26], [53], [62], [17], [63]. In order to obtain a fine-grained view of the infrastructure, Horus leverages cAdvisor 7 , a container monitoring framework.…”

Section: System Implementationmentioning

confidence: 99%

See 1 more Smart Citation

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung

Borowiec

Yang

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the same GPU has been shown to be effective, this can incur interference causing slowdown. In this paper we propose Horus: an interference-aware and prediction-based resource manager for DL systems. Horus proactively predicts GPU utilization of heterogeneous DL jobs extrapolated from the DL model's computation graph features, removing the need for online profiling and isolated reserved GPUs. Through micro-benchmarks and job co-location combinations across heterogeneous GPU hardware, we identify GPU utilization as a general proxy metric to determine good placement decisions, in contrast to current approaches which reserve isolated GPUs to perform online profiling and directly measure GPU utilization for each unique submitted job. Our approach promotes high resource utilization and makespan reduction; via real-world experimentation and large-scale trace driven simulation, we demonstrate that Horus outperforms other DL resource managers by up to 61.5% for GPU resource utilization, 23.7-30.7% for makespan reduction and 68.3% in job wait time reduction.

show abstract

Section: System Implementationmentioning

confidence: 99%

“…Understanding and achieving high resource utilization for heterogeneous workloads-including DL-in cloud computing is an important topic [30], [28], [21], [22], [14], [62], [8], [6], [17], [18], [10]. GPU profiling.…”

Section: Related Workmentioning

confidence: 99%

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Yeung

Borowiec

Yang

et al. 2022

IEEE Trans. Parallel Distrib. Syst.

Self Cite

View full text Add to dashboard Cite

show abstract

“…drones have proliferated recently and widely adopted in numerous industrial or commercial areas such as weather observation [1], disaster management [2], agricultural irrigation [3], etc. The advancement of such applications is mainly propelled by diverse deep neural networks models [4], [5], [6], [7] and massive-scale high performance computing [8], [9]. While promising, security and privacy issues become the main concerns in the traffic management for the safe presence of UAVs in the airspace [10], [11].…”

Section: Introductionmentioning

confidence: 99%

BisSiam: Bispectrum Siamese Network Based Contrastive Learning for UAV Anomaly Detection

Li¹,

Zhou²,

Cai³

et al. 2023

IEEE Trans. Knowl. Data Eng.

Self Cite

View full text Add to dashboard Cite

In recent years, a surging number of unmanned aerial vehicles (UAVs) are pervasively utilized in many areas. However, the increasing number of UAVs may cause privacy and security issues such as voyeurism and espionage. It is critical for individuals or organizations to manage their behaviors and proactively prevent the misbehaved invasion of unauthorized UAVs through effective anomaly detection. The UAV anomaly detection framework needs to cope with complex signals in the noisy-prone environments and to function with very limited labeled samples. This paper proposes BISSIAM, a novel framework that is capable of identifying UAV presence, types and operation modes. BISSIAM converts UAVs signals to bispectrum as the input and exploits a siamese network based contrastive learning model to learn the vector encoding. A sampling mechanism is proposed for optimizing the sample size involved in the model training whilst ensuring the model accuracy without compromising the training efficiency. Finally, we present a similarity-based fingerprint matching mechanism for detecting unseen UAVs without the need of retraining the whole model. Experiment results show that our approach outperforms other baselines and can reach 92.85% accuracy of UAV type detection in unsupervised learning scenarios. 91.4% accuracy can be achieved when BISSIAM is used for detecting the UAV type of the out-of-sample UAVs.

show abstract

“…However, they are not innately designed to consider intercluster (cluster-to-cluster) performance when enacting workload placement and execution decisions. This is problematic as clusters leveraged for cloud computing are exposed to network volatility [3], dynamic utilization [30], and heterogeneous scheduling architectures [7] -all which are intrinsic to federated cluster environments. The majority of federated orchestration systems only consider resource demand and reservation [5], [28], and omit characteristics at network-level (bandwidth, latency), node-level (interference, contention) and cluster-level (scheduler type).…”

Section: Introductionmentioning

confidence: 99%

An Empirical Study of Inter-cluster Resource Orchestration within Federated Cloud Clusters

Lindsay

Yeung

Elkhatib

et al. 2021

2021 IEEE International Conference on Joint Cloud Computing (JCC)

Self Cite

View full text Add to dashboard Cite

Federated clusters are composed of multiple independent clusters of machines interconnected by a resource management system, and possess several advantages over centralized cloud datacenter clusters including seamless provisioning of applications across large geographic regions, greater fault tolerance, and increased cluster resource utilization. However, while existing resource management systems for federated clusters are capable of improving application intra-cluster performance, they do not capture inter-cluster performance in their decision making. This is important given federated clusters must execute a wide variety of applications possessing heterogeneous system architectures, which are a impacted by unique inter-cluster performance conditions such as network latency and localized cluster resource contention. In this work we present an empirical study demonstrating how inter-cluster performance conditions negatively impact federated cluster orchestration systems. We conduct a series of micro-benchmarks under various cluster operational scenarios showing the critical importance in capturing inter-cluster performance for resource orchestration in federated clusters. From this benchmark, we determine precise limitations in existing federated orchestration, and highlight key insights to design future orchestration systems. Findings of notable interest entail different application types exhibiting innate performance affinities across various federated cluster operational conditions, and experience substantial performance degradation from even minor increases to latency (8.7x) and resource contention (12.0x) in comparison to centralized cluster architectures.

show abstract

Performance-Aware Speculative Resource Oversubscription for Large-Scale Clusters

Cited by 23 publications

References 31 publications

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

BisSiam: Bispectrum Siamese Network Based Contrastive Learning for UAV Anomaly Detection

An Empirical Study of Inter-cluster Resource Orchestration within Federated Cloud Clusters

Contact Info

Product

Resources

About