2014 IEEE International Conference on Cluster Computing (CLUSTER) 2014
DOI: 10.1109/cluster.2014.6968757
|View full text |Cite
|
Sign up to set email alerts
|

Exploring void search for fault detection on extreme scale systems

Abstract: Mean Time Between Failures (MTBF), now cal culated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance technologies. For instance, it has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer total ch… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 15 publications
(4 citation statements)
references
References 34 publications
0
4
0
Order By: Relevance
“…The monitoring and diagnostic subsystem of Tianhe HPC systems consist of three layers of management units, the Board management Unit (BMU), including the Chassis Management Unit (CMU), and the System Management Unit (SMU), which are connected in a unified way through a dedicated monitoring and diagnostic network [34]. The subsystem has more than 200 hardware monitoring indicators, covering voltage, current, temperature, humidity, liquid cooling system, air cooling system, self-developed high-speed network card, and many other aspects.As the failure node prediction mechanism is implemented as a plugin, more advanced techniques can be easily integrated with ESLURM [31], [35], [36].…”
Section: Failure Node Predictionmentioning
confidence: 99%
“…The monitoring and diagnostic subsystem of Tianhe HPC systems consist of three layers of management units, the Board management Unit (BMU), including the Chassis Management Unit (CMU), and the System Management Unit (SMU), which are connected in a unified way through a dedicated monitoring and diagnostic network [34]. The subsystem has more than 200 hardware monitoring indicators, covering voltage, current, temperature, humidity, liquid cooling system, air cooling system, self-developed high-speed network card, and many other aspects.As the failure node prediction mechanism is implemented as a plugin, more advanced techniques can be easily integrated with ESLURM [31], [35], [36].…”
Section: Failure Node Predictionmentioning
confidence: 99%
“…Gainaru et al [29] modelled the normal and faulty behaviour of large-scale HPC systems, which would also be very helpful in the HPC system failure prediction/detection. Berrocal et al [30] proposed an effective approach for fault detection based on the Void Search (VS) algorithm, which is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. The log entropy technique has also been employed for error detection within patterns in [31] and [32], since log entropy measures the changes in the frequency of log events to capture the system's behavior.…”
Section: Related Workmentioning
confidence: 99%
“…They trained a binary classification model to detect the potential node failure based on a given time sequence of monitoring data collected from each node. Berrocal et al [10] used environmental logs to extract numerical indicators and conducted a Void Search (VS) algorithm on these numerical values for the failure prediction task.…”
Section: Homogeneous Systemsmentioning
confidence: 99%