Desh

Das, A.; Mueller, Frank; Siegel, Charles; Vishnu, Abhinav

doi:10.1145/3208040.3208051

Cited by 63 publications

(2 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The present research primarily focus on highlighting fault sources and developing the corresponding prediction mechanisms [ 18 ]. Das et al also propose a machine learning method that uses short-term memory networks to predict node failures with three minutes lead time, 85% recall, and 83% accuracy [ 1 ]. Frank et al based on multiple, independently trained neural networks using different lead-up time offsets, combined with simple majority voting where a consensus among neural networks is required to issue a positive (failure) final prediction [ 8 ].…”

Section: Related Studiesmentioning

confidence: 99%

“…In recent years, owing to the increasing demand for high-performance computing (HPC) as well as the scale-up supercomputers and intelligent computing systems, the reliability of large-scale computing systems has been investigated extensively [ 1 – 4 ]. The system operation is complex, and failures occur frequently which are difficult to detect, locate, diagnose, analyze, and debug [ 1 , 5 , 6 ]. The existing system health check monitoring and techniques generally monitor faults through different log sources, such as root cause diagnosis and fault detection.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Application of multivariate time-series model for high performance computing (HPC) fault prediction

Pei,

Yuan,

Mao

et al. 2023

PLoS ONE

View full text Add to dashboard Cite

Aiming at the high reliability demand of increasingly large and complex supercomputing systems, this paper proposes a multidimensional fusion CBA-net (CNN-BiLSTAM-Attention) fault prediction model based on HDBSCAN clustering preprocessing classification data, which can effectively extract and learn the spatial and temporal features in the predecessor fault log. The model can effectively extract and learn the spatial and temporal features from the predecessor fault logs, and has the advantages of high sensitivity to time series features and sufficient extraction of local features, etc. The RMSE of the model for fault occurrence time prediction is 0.031, and the prediction accuracy of node location for fault occurrence is 93% on average, as demonstrated by experiments. The model can achieve fast convergence and improve the fine-grained and accurate fault prediction of large supercomputers.

show abstract

Section: Related Studiesmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Application of multivariate time-series model for high performance computing (HPC) fault prediction

Pei,

Yuan,

Mao

et al. 2023

PLoS ONE

View full text Add to dashboard Cite

show abstract

FP‐JSC: Job failure prediction on supercomputers through job application sequence correlation

Xian,

Yang,

2024

Concurrency and Computation

View full text Add to dashboard Cite

SummarySupercomputers are advanced computing systems interconnected through high‐speed communication networks, consisting of independent computational nodes. During the unfolding of the big data era, the potent computational capabilities of these supercomputers play a pivotal role in scientific computing. Despite executing numerous advanced computational science and engineering tasks on supercomputers, many submitted jobs fail due to various factors, resulting in user inefficiencies. These failures not only consume system resources but also reduce the overall efficiency of the system. Previous research often couples job performance features with a single machine learning method for predicting job failure. However, a primary hurdle emerges from the high cost of gathering these features, complicating their real‐world applicability. To address this challenge, our study establishes correlations among job applications through extensive job log analysis. Leveraging correlations, we propose a predictive framework based on job application sequence correlation (called FP‐JSC). This innovative framework employs multiple machine learning models to offer holistic predictions, selecting the most suitable model based on its learning effectiveness. Moreover, the framework optimizes feature collection expenses without adversely affecting job execution. We determine job applications using both job paths and job names, with the former emerging as a novel feature derived from supplementary monitoring data. Empirical results underscore FP‐JSC's effectiveness, accurately identifying over 89% of jobs with 95% specificity and 89% sensitivity—outperforming single prediction methods employed in related works.

show abstract

An Assessment of ChatGPT on Log Data

Mudgal,

Wouhaybi

2023

Communications in Computer and Information Science

View full text Add to dashboard Cite

Desh

Cited by 63 publications

References 29 publications

Application of multivariate time-series model for high performance computing (HPC) fault prediction

Application of multivariate time-series model for high performance computing (HPC) fault prediction

FP‐JSC: Job failure prediction on supercomputers through job application sequence correlation

An Assessment of ChatGPT on Log Data

Contact Info

Product

Resources

About