2016 SAI Computing Conference (SAI) 2016
DOI: 10.1109/sai.2016.7555961
|View full text |Cite
|
Sign up to set email alerts
|

Machine learning based job status prediction in scientific clusters

Abstract: Abstract-Large high-performance computing systems are built with increasing number of components with more CPU cores, more memory, and more storage space. At the same time, scientific applications have been growing in complexity. Together, they are leading to more frequent unsuccessful job statuses on HPC systems. From measured job statuses, 23.4% of CPU time was spent to the unsuccessful jobs. We set out to study whether these unsuccessful job statuses could be anticipated from known job characteristics. To e… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 14 publications
(9 citation statements)
references
References 21 publications
0
9
0
Order By: Relevance
“…Papers [2][3][4][5] demonstrate related research, such as job status prediction, failure prediction and anomaly detection, based on log file analysis with machine learning with good results. Whether abnormal detection or job status prediction, the number of correct instances (majority class) should be much more than the number of incorrect instances (minority class) in a dataset, which leads to an imbalanced dataset just like our dataset presented here.…”
Section: Discussionmentioning
confidence: 99%
See 3 more Smart Citations
“…Papers [2][3][4][5] demonstrate related research, such as job status prediction, failure prediction and anomaly detection, based on log file analysis with machine learning with good results. Whether abnormal detection or job status prediction, the number of correct instances (majority class) should be much more than the number of incorrect instances (minority class) in a dataset, which leads to an imbalanced dataset just like our dataset presented here.…”
Section: Discussionmentioning
confidence: 99%
“…Klinkenberg et al [2] proposed and evaluated a method for predicting failures with framed cluster monitoring data and extracted features describing the characteristic of the signals. Authors in [3] presented a machine learning based Random forests (RF) classification model for predicting unsuccessful job executions. In modern supercomputing centers, successful or health jobs occupy a very large part of job databases.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…To grasp the complex relation among system components and to discover early symptoms of failures to come, several works rely on ML techniques, e.g., [10,[71][72][73][74][75]. Abu-Samah et al [74] rely on Bayesian networks; as a drawback, this approach requires to be complemented with the extraction and validation of system patterns, which may involve expert opinions or elicitations on several levels.…”
Section: Related Workmentioning
confidence: 99%