Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers

Ghiasvand, Siavash; Ciorba, Florina M.; Tschüter, Ronny; Nagel, Wolfgang E.

doi:10.1109/pdp.2016.101

Cited by 13 publications

(9 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Considering the high number of components in modern HPC systems and the strong correlations between node failures [8], [15], several studies investigate behavioral analysis to predict failures via anomaly detection. Both supervised [20] and unsupervised [21] approaches were proposed.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Anomaly Detection in High Performance Computers: A Vicinity Perspective

Ghiasvand

Ciorba

2019

2019 18th International Symposium on Parallel and Distributed Computing (ISPDC)

Self Cite

View full text Add to dashboard Cite

In response to the demand for higher computational power, the number of computing nodes in high performance computers (HPC) increases rapidly. Exascale HPC systems are expected to arrive by 2020. With drastic increase in the number of HPC system components, it is expected to observe a sudden increase in the number of failures which, consequently, poses a threat to the continuous operation of the HPC systems. Detecting failures as early as possible and, ideally, predicting them, is a necessary step to avoid interruptions in HPC systems operation. Anomaly detection is a well-known general purpose approach for failure detection, in computing systems. The majority of existing methods are designed for specific architectures, require adjustments on the computing systems hardware and software, need excessive information, or pose a threat to users' and systems' privacy. This work proposes a node failure detection mechanism based on a vicinity-based statistical anomaly detection approach using passively collected and anonymized system log entries. Application of the proposed approach on system logs collected over 8 months indicates an anomaly detection precision between 62% to 81%.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Anomaly detection is a well-known general purpose approach for detecting failures in computing systems [4]. In HPC systems, system log analysis can be used for anomaly detection for the purpose of preventing failures [5]- [8]. All HPC systems on the current TOP500 [9] list are Linux-based.…”

Section: Introductionmentioning

confidence: 99%

Anomaly Detection in High Performance Computers: A Vicinity Perspective

Ghiasvand

Ciorba

2019

2019 18th International Symposium on Parallel and Distributed Computing (ISPDC)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Even though this pattern changes according to the workload on the node and other environmental parameters, the pattern mostly remains similar to previous patterns of the same node. Previous studies showed that the majority of neighboring nodes (located in a similar rack or island) also exhibit similar syslog generation patterns [5]. To extract the patterns, a failure detection mechanism needs to assign an event category to each syslog entry.…”

Section: Assessing Data Usefulnessmentioning

confidence: 99%

“…Early failure detection is a new class of failure recovery methods that can be beneficial for HPC systems with short MTBF. Detecting failures in their early stage can reduce their negative effects by preventing their propagation to other parts of the system [5].…”

Section: Introductionmentioning

confidence: 99%

Assessing Data Usefulness for Failure Analysis in Anonymized System Logs

Ghiasvand¹,

Ciorba²

2018

2018 17th International Symposium on Parallel and Distributed Computing (ISPDC)

Self Cite

View full text Add to dashboard Cite

System logs are a valuable source of information for the analysis and understanding of systems behavior for the purpose of improving their performance. Such logs contain various types of information, including sensitive information. Information deemed sensitive can either directly be extracted from system log entries by correlation of several log entries, or can be inferred from the combination of the (non-sensitive) information contained within system logs with other logs and/or additional datasets. The analysis of system logs containing sensitive information compromises data privacy. Therefore, various anonymization techniques, such as generalization and suppression have been employed, over the years, by data and computing centers to protect the privacy of their users, their data, and the system as a whole. Privacy-preserving data resulting from anonymization via generalization and suppression may lead to significantly decreased data usefulness, thus, hindering the intended analysis for understanding the system behavior. Maintaining a balance between data usefulness and privacy preservation, therefore, remains an open and important challenge. Irreversible encoding of system logs using collision-resistant hashing algorithms, such as SHAKE-128, is a novel approach previously introduced by the authors to mitigate data privacy concerns. The present work describes a study of the applicability of the encoding approach from earlier work on the system logs of a production high performance computing system. Moreover, a metric is introduced to assess the data usefulness of the anonymized system logs to detect and identify the failures encountered in the system. 1 System log and syslog are used interchangeably in this work 2 https://top500.org/list/2017/11/ arXiv:1805.01790v1 [cs.DC] 4 May 2018Germany is conducted. Moreover, a metric is introduced to assess the usefulness of data within system logs during and postencoding for the purpose of detecting and identifying failures encountered in the HPC system. The use of the proposed metric on the considered system logs show that the anonymized data retains a high degree of usefulness for the intended purpose of failure detection and identification.The novelty and contributions of the anonymization approach to retain data usefulness proposed in this work consists of:(1) anonymization of unstructured system log messages; (2) preservation of the usefulness of system logs (especially for semi-/automated failure analysis); (3) a guarantee for protecting data privacy; (4) significant reduction of the required capacity for the storage of the anonymized system logs; (5) offering the choice to set the degree of generalization according to the intended data usage; and (6) providing readily analyzable data, anonymized and encoded, that does not require decoding before analysis.The remainder of this work is structured as follows. Section II overviews the work related to data anonymization and usefulness. The anonymization approach employed herein is described in Section III. The proposed...

show abstract

“…After that, in 2014, the Blue Waters supercomputer reported a mean time between node failures of 6.7 hours [2]. Most recently, the Taurus system located in TU Dresden reported a mean time between node failures of 3.65 hours [3].…”

Section: Introductionmentioning

confidence: 99%

MATCH: An MPI Fault Tolerance Benchmark Suite

Guo

Georgakoudis

Parasyris

et al. 2020

2020 IEEE International Symposium on Workload Characterization (IISWC)

View full text Add to dashboard Cite

MPI has been ubiquitously deployed in flagship HPC systems aiming to accelerate distributed scientific applications running on tens of hundreds of processes and compute nodes. Maintaining the correctness and integrity of MPI application execution is critical, especially for safety-critical scientific applications. Therefore, a collection of effective MPI fault tolerance techniques have been proposed to enable MPI application execution to efficiently resume from system failures. However, there is no structured way to study and compare different MPI fault tolerance designs, so to guide the selection and development of efficient MPI fault tolerance techniques for distinct scenarios. To solve this problem, we design, develop, and evaluate a benchmark suite called MATCH to characterize, research, and comprehensively compare different combinations and configurations of MPI fault tolerance designs. Our investigation derives useful findings: (1) Reinit recovery in general performs better than ULFM recovery; (2) Reinit recovery is independent of the scaling size and the input problem size, whereas ULFM recovery is not; (3) Using Reinit recovery with FTI checkpointing is a highly efficient fault tolerance design. MATCH code is available at https://github.com/kakulo/MPI-FT-Bench.

show abstract

Lessons Learned from Spatial and Temporal Correlation of Node Failures in High Performance Computers

Cited by 13 publications

References 10 publications

Anomaly Detection in High Performance Computers: A Vicinity Perspective

Anomaly Detection in High Performance Computers: A Vicinity Perspective

Assessing Data Usefulness for Failure Analysis in Anonymized System Logs

MATCH: An MPI Fault Tolerance Benchmark Suite

Contact Info

Product

Resources

About