2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP) 2016
DOI: 10.1109/pdp.2016.101
|View full text |Cite
|
Sign up to set email alerts
|

Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers

Abstract: Abstract-In this paper we study the correlation of node failures in time and space. Our study is based on measurements of a production high performance computer over an 8-month time period. We draw possible types of correlations between node failures and show that, in many cases, there are direct correlations between observed node failures. The significance of such a study is twofold: achieving a clearer understanding of correlations between node failures and enabling failure detection as early as possible. Th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
9
0

Year Published

2017
2017
2023
2023

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 13 publications
(9 citation statements)
references
References 10 publications
0
9
0
Order By: Relevance
“…Considering the high number of components in modern HPC systems and the strong correlations between node failures [8], [15], several studies investigate behavioral analysis to predict failures via anomaly detection. Both supervised [20] and unsupervised [21] approaches were proposed.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Considering the high number of components in modern HPC systems and the strong correlations between node failures [8], [15], several studies investigate behavioral analysis to predict failures via anomaly detection. Both supervised [20] and unsupervised [21] approaches were proposed.…”
Section: Related Workmentioning
confidence: 99%
“…Anomaly detection is a well-known general purpose approach for detecting failures in computing systems [4]. In HPC systems, system log analysis can be used for anomaly detection for the purpose of preventing failures [5]- [8]. All HPC systems on the current TOP500 [9] list are Linux-based.…”
Section: Introductionmentioning
confidence: 99%
“…Even though this pattern changes according to the workload on the node and other environmental parameters, the pattern mostly remains similar to previous patterns of the same node. Previous studies showed that the majority of neighboring nodes (located in a similar rack or island) also exhibit similar syslog generation patterns [5]. To extract the patterns, a failure detection mechanism needs to assign an event category to each syslog entry.…”
Section: Assessing Data Usefulnessmentioning
confidence: 99%
“…Early failure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. Detecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system [5].…”
Section: Introductionmentioning
confidence: 99%
“…After that, in 2014, the Blue Waters supercomputer reported a mean time between node failures of 6.7 hours [2]. Most recently, the Taurus system located in TU Dresden reported a mean time between node failures of 3.65 hours [3].…”
Section: Introductionmentioning
confidence: 99%