2014 IEEE International Conference on Cluster Computing (CLUSTER) 2014
DOI: 10.1109/cluster.2014.6968768
|View full text |Cite
|
Sign up to set email alerts
|

Digging deeper into cluster system logs for failure prediction and root cause diagnosis

Abstract: As the sizes of supercomputers and data centers grow towards exascale, failures become normal. System logs play a critical role in the increasingly complex tasks of automatic failure prediction and diagnosis. Many methods for failure prediction are based on analyzing event logs for large scale systems, but there is still neither a widely used one to predict failures based on both non-fatal and fatal events, nor a precise one that uses fine-grained information (such as failure type, node location, related appli… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
18
0

Year Published

2016
2016
2023
2023

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 39 publications
(18 citation statements)
references
References 34 publications
0
18
0
Order By: Relevance
“… Wang et al (2017) apply random forests in event logs to predict maintenance of equipment (in their case study, ATMs). Fu et al (2014b) use system logs (from clusters) to generate causal dependency graphs and predict failures. Russo, Succi & Pedrycz (2015) mine system logs (more specifically, sequences of logs) to predict the system’s reliability by means of linear radial basis functions, and multi-layer perceptron learners.…”
Section: Resultsmentioning
confidence: 99%
“… Wang et al (2017) apply random forests in event logs to predict maintenance of equipment (in their case study, ATMs). Fu et al (2014b) use system logs (from clusters) to generate causal dependency graphs and predict failures. Russo, Succi & Pedrycz (2015) mine system logs (more specifically, sequences of logs) to predict the system’s reliability by means of linear radial basis functions, and multi-layer perceptron learners.…”
Section: Resultsmentioning
confidence: 99%
“…There has been extensive research in the detection of anomalies or outliers in logs using both machine learning approaches and using relations across multivariate time-series data in several application domains [14,17,18,[24][25][26][27][28][29][30][31]. In this section, we review a set of representative examples of outlier detection applied to log analysis, and highlight a key focus of the contributions of our paper in the context of these rich body of prior art.…”
Section: Related Workmentioning
confidence: 99%
“…With division to subsystems, failure data is limited to the affected subsystem only. In Reference [8], a systematic methodology for reconstructing event order and establishing correlations among events, which indicate the root causes of a given failure from very large system logs is presented. A diagnostics tool was developed to extract the log entries as structured message templates and uses statistical correlation analysis to establish probable cause and effect relationships for the fault being analyzed.…”
Section: Event Managementmentioning
confidence: 99%