Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing 2018
DOI: 10.1145/3208040.3208051
|View full text |Cite
|
Sign up to set email alerts
|

Desh

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 63 publications
(2 citation statements)
references
References 29 publications
0
2
0
Order By: Relevance
“…The present research primarily focus on highlighting fault sources and developing the corresponding prediction mechanisms [ 18 ]. Das et al also propose a machine learning method that uses short-term memory networks to predict node failures with three minutes lead time, 85% recall, and 83% accuracy [ 1 ]. Frank et al based on multiple, independently trained neural networks using different lead-up time offsets, combined with simple majority voting where a consensus among neural networks is required to issue a positive (failure) final prediction [ 8 ].…”
Section: Related Studiesmentioning
confidence: 99%
See 1 more Smart Citation
“…The present research primarily focus on highlighting fault sources and developing the corresponding prediction mechanisms [ 18 ]. Das et al also propose a machine learning method that uses short-term memory networks to predict node failures with three minutes lead time, 85% recall, and 83% accuracy [ 1 ]. Frank et al based on multiple, independently trained neural networks using different lead-up time offsets, combined with simple majority voting where a consensus among neural networks is required to issue a positive (failure) final prediction [ 8 ].…”
Section: Related Studiesmentioning
confidence: 99%
“…In recent years, owing to the increasing demand for high-performance computing (HPC) as well as the scale-up supercomputers and intelligent computing systems, the reliability of large-scale computing systems has been investigated extensively [ 1 4 ]. The system operation is complex, and failures occur frequently which are difficult to detect, locate, diagnose, analyze, and debug [ 1 , 5 , 6 ]. The existing system health check monitoring and techniques generally monitor faults through different log sources, such as root cause diagnosis and fault detection.…”
Section: Introductionmentioning
confidence: 99%