End-to-end framework for fault management for open source clusters

Hammond, John L.; Minyard, Tommy; Browne, J. C.

doi:10.1145/1838574.1838583

Cited by 21 publications

(21 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CRUMEL targets processing of Syslogs [19], Rationalized message logs [17] and TACC Stats resource use data [9]. The Rationalized message log [17] is a special type of message log that incorporates a logical structure and additional content such as job-identification to the POSIX formatted logs.…”

Section: Crumel: Data Type Extractionmentioning

confidence: 99%

“…The Rationalized message log [17] is a special type of message log that incorporates a logical structure and additional content such as job-identification to the POSIX formatted logs. TACC Stats [9] is a job-oriented and logically structured version of the conventional Sysstat system performance monitor.…”

Section: Crumel: Data Type Extractionmentioning

confidence: 99%

“…The TACC Stats [9] monitoring system and Rationalized message logging [17] resolve resource usage and system messages by jobs, nodes and time for open-source Linux-based clusters. Previous work [13], [18] has applied only Pearson Correlation, but there is little work which show that more events correlated with system failures can only be identified by applying different correlation algorithms.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Chuah

Jhumka

Browne

et al. 2016

2016 IEEE 23rd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

Copyright and reuse:The Warwick Research Archive Portal (WRAP) makes this work by researchers of the University of Warwick available open access under the following conditions. Copyright © and all moral rights to the version of the paper presented here belong to the individual author(s) and/or other copyright owners. To the extent reasonable and practicable the material made available in WRAP has been checked for eligibility before being made available.Copies of full items can be used for personal research or study, educational, or not-for profit purposes without prior permission or charge. Provided that the authors, title and full bibliographic details are credited, a hyperlink and/or URL is given for the original metadata page and the content is not changed in any way.Publisher's statement: © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting /republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. A note on versions:The version presented here may differ from the published version or, version of record, if you wish to cite this item you are advised to consult the publisher's version. Please see the 'permanent WRAP url' above for details on accessing the published version and note that access may require a subscription. Abstract-Failure diagnosis for large compute clusters using only message logs is known to be incomplete. Recent availability of resource use data provides another potentially useful source of data for failure detection and diagnosis. Early work combining message logs and resource use data for failure diagnosis has shown promising results. This paper describes the CRUMEL framework which implements a new approach to combining rationalized message logs and resource use data for failure diagnosis. CRUMEL identifies patterns of errors and resource use and correlates these patterns by time with system failures. Application of CRUMEL to data from the Ranger supercomputer has yielded improved diagnoses over previous research. CRUMEL has: (i) showed that more events correlated with system failures can only be identified by applying different correlation algorithms, (ii) confirmed six groups of errors, (iii) identified Lustre I/O resource use counters which are correlated with occurrence of Lustre faults which are potential flags for online detection of failures, (iv) matched the dates of correlated error events and correlated resource use with the dates of compute node hangups and (v) identified two more error groups associated with compute node hang-ups. The pre-processed data will be put on the public domain in September, 2016.

show abstract

Section: Crumel: Data Type Extractionmentioning

confidence: 99%

Section: Crumel: Data Type Extractionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Chuah

Jhumka

Browne

et al. 2016

2016 IEEE 23rd International Conference on High Performance Computing (HiPC)

View full text Add to dashboard Cite

show abstract

“…It integrates anomaly analysis and correlation analysis for assessing the impact of resource utilization anomalies on system failures. ANCOR processes both: (1) the resource use data which contains node-level and job-level statistics of the I/O and transfer rates and virtual memory utilization of the cluster system, and (2) the rationalized logs [16] which contain the events generated by the components of the cluster system. The coupling of resource use data by node and job with the rationalized message logs enables a two phase approach, where the resource usage data is used to identify resource anomalies and provide partial diagnosis, and the message log analysis is used to obtain a more specific and precise diagnosis.…”

Section: Introductionmentioning

confidence: 99%

Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Chuah¹,

Jhumka

Narasimhamurthy³

et al. 2013

2013 IEEE 32nd International Symposium on Reliable Distributed Systems

View full text Add to dashboard Cite

Bursts of abnormally high use of resources are thought to be an indirect cause of failures in large cluster systems, but little work has systematically investigated the role of high resource usage on system failures, largely due to the lack of a comprehensive resource monitoring tool which resolves resource use by job and node. The recently developed TACC Stats resource use monitor provides the required resource use data. This paper presents the ANCOR diagnostics system that applies TACC Stats data to identify resource use anomalies and applies log analysis to link resource use anomalies with system failures. Application of ANCOR to first identify multiple sources of resource anomalies on the Ranger supercomputer, then correlate them with failures recorded in the message logs and diagnosing the cause of the failures, has identified four new causes of compute node soft lockups. ANCOR can be adapted to any system that uses a resource use monitor which resolves resource use by job.

show abstract

“…Recognizing the challenges faced by the administrators of large cluster systems, many contributions to the study of system logs [1]- [7], fault detection [8]- [11], failure prediction [12]- [17], cluster logs preprocessing [18] and fault management [19], [20] have been made. Most of the existing work has focused on methods that improved the accuracy of fault detection and failure prediction.…”

Section: Introductionmentioning

confidence: 99%

Establishing Hypothesis for Recurrent System Failures from Cluster Log Files

Chuah

Lee

Tjhi

et al. 2011

2011 IEEE Ninth International Conference on Dependable, Autonomic and Secure Computing

Self Cite

View full text Add to dashboard Cite

A goal for the analysis of supercomputer logs is to establish causal relationships among events which reflect significant state changes in the system. Establishing these relationships is at the heart of failure diagnosis. In principle, a log analysis tool could automate many of the manual steps systems administrators must currently use to diagnose system failures. However, supercomputer logs are unstructured, incomplete and contain considerable ambiguity so that direct discovery of causal relationships is difficult. This paper describes the second generation FDiag logbased failure diagnostics framework that provides automation of the manual failure diagnosis process and determines with high confidence, the likely cause of the failure, the components involved and the event sequences which contain the times of the causal and terminal events. FDiag extracts relevant events from the system logs, performs correlation analysis on these events and from these correlations determines the components involved and the event sequences. The diagnostics capabilities of FDiag are validated by comparing its assessments on known instances of recurrent failures on the Ranger supercomputer at the University of Texas at Austin. We believe FDiag is the first log analyzer to demonstrate this level of diagnostics capability from the system logs of an open source software stack incorporating Linux and the Lustre file system. FDiag will be put into production use for support of failure diagnosis on Ranger in September, 2011.

show abstract

End-to-end framework for fault management for open source clusters

Cited by 21 publications

References 17 publications

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Using Message Logs and Resource Use Data for Cluster Failure Diagnosis

Linking Resource Usage Anomalies with System Failures from Cluster Log Data

Establishing Hypothesis for Recurrent System Failures from Cluster Log Files

Contact Info

Product

Resources

About