Towards Scalable Resource Management for Supercomputers

Dai, Yiqin; Dong, Yong; Lü, Kai; Wang, Ruibo; Zhang, Wei; Chen, Juan; Shao, Mingtian; Wang, Zheng

doi:10.1109/sc41404.2022.00029

Cited by 4 publications

(1 citation statement)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Slurm (Simple Linux Utility for Resource Management) is an open‐source resource management and job scheduling system known for its fault tolerance and high scalability, making it a popular choice for both large and small Linux clusters 27 . Many of the world's top‐ranked supercomputers employ Slurm to ensure effective management of resources and jobs, 28 preventing interference and enhancing execution efficiency. Within this intricate computing environment, each job's execution details are meticulously documented in every line of the job logs.…”

Section: Application Sequence and Framework Designmentioning

confidence: 99%

FP‐JSC: Job failure prediction on supercomputers through job application sequence correlation

Xian,

Yang,

2024

Concurrency and Computation

View full text Add to dashboard Cite

SummarySupercomputers are advanced computing systems interconnected through high‐speed communication networks, consisting of independent computational nodes. During the unfolding of the big data era, the potent computational capabilities of these supercomputers play a pivotal role in scientific computing. Despite executing numerous advanced computational science and engineering tasks on supercomputers, many submitted jobs fail due to various factors, resulting in user inefficiencies. These failures not only consume system resources but also reduce the overall efficiency of the system. Previous research often couples job performance features with a single machine learning method for predicting job failure. However, a primary hurdle emerges from the high cost of gathering these features, complicating their real‐world applicability. To address this challenge, our study establishes correlations among job applications through extensive job log analysis. Leveraging correlations, we propose a predictive framework based on job application sequence correlation (called FP‐JSC). This innovative framework employs multiple machine learning models to offer holistic predictions, selecting the most suitable model based on its learning effectiveness. Moreover, the framework optimizes feature collection expenses without adversely affecting job execution. We determine job applications using both job paths and job names, with the former emerging as a novel feature derived from supplementary monitoring data. Empirical results underscore FP‐JSC's effectiveness, accurately identifying over 89% of jobs with 95% specificity and 89% sensitivity—outperforming single prediction methods employed in related works.

show abstract

Section: Application Sequence and Framework Designmentioning

confidence: 99%