Modeling and tolerating heterogeneous failures in large parallel systems

Heien, E. M.; Kondo, Derrick; Gainaru, Ana; LaPine, Dan; Kramer, Bill; Cappello, Franck

doi:10.1145/2063384.2063444

Cited by 66 publications

(48 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…From our previous research we observed that for example memory and processor cache failures usually result in a single faulty component generating hundreds or thousands of messages in less than a day [19].…”

Section: Preprocessingmentioning

confidence: 99%

“…The method has little effect in the sequence extraction phase since, we are looking at frequent patterns and the chance of having two of the same events that happen on two different locations and that have different root causes frequent in a historic log in close to zero [19,20,23]. However the improvement is given in the prediction process when those two events could lead to two different effects and need to be both monitored.…”

Section: Preprocessingmentioning

confidence: 99%

See 1 more Smart Citation

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

Gainaru¹,

Cappello

Fullop³

et al. 2011

Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques

Self Cite

View full text Add to dashboard Cite

In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60% of events with a precision of around 85% at a lower event granularity than before.

show abstract

Section: Preprocessingmentioning

confidence: 99%

Section: Preprocessingmentioning

confidence: 99%

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

Gainaru¹,

Cappello

Fullop³

et al. 2011

Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques

Self Cite

View full text Add to dashboard Cite

show abstract

“…Other have proposed rejuvenation techniques to minimize software failures [17], [18]. In addition, failure prediction has been studied [8], [19] and has showed promising results that could be leveraged for online system monitoring. While those studies are fundamental for achieving high fault tolerance at scale, they all consider a rather stable MTBF and they do not take failure bursts into consideration.…”

Section: Related Workmentioning

confidence: 99%

“…Event analysis and classification in large-scale system have been the subject of many studies, and research has shown that they can lead to good prediction results [8], [9]. One of the major challenges in this endeavor is the impact of false positives.…”

Section: B Introspective Systemsmentioning

confidence: 99%

Monitoring strategies for scalable dynamic checkpointing

Perarnau

Bautista-Gomez

2016

2016 Seventh International Green and Sustainable Computing Conference (IGSC)

View full text Add to dashboard Cite

Abstract-Resilience is an important challenge for extremescale supercomputers. Failures in current supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. The detection of those periods is important in order to adjust the system to new conditions. In this paper we present a monitoring system that listens to hardware events across computing nodes and forwards important events to the fault tolerance runtime so it can react to those regime changes. Our evaluation at scale shows several aspects of this dynamic checkpointing scheme, critical to understanding its applicability on production systems, as well as to identifying possible avenues for future improvements. In particular, we evaluate the ability of our system to monitor as many types of events as possible, measure their importance, and forward them to the resilience runtime.

show abstract

“…The key parameter is the MTBF µ = 1 λ . Weibull distributions are a good example of probability distributions that account for infant mortality, and they are widely used to model failures on computer platforms [42,67,54,39,43]. The definition of Weibull(λ ), the Weibull distribution law of shape parameter k and scale parameter λ , goes as follows:…”

Section: Resilience At Scalementioning

confidence: 99%

Fault-Tolerance Techniques for High-Performance Computing

2015

Computer Communications and Networks

View full text Add to dashboard Cite

This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be improved via fault prediction or replication. Then we move to application-specific methods, such as ABFT. We conclude the report by discussing techniques to cope with silent errors (or silent data corruption).This report is a slightly modified version of the first chapter of the monograph Fault tolerance techniques for high-performance computing edited by Thomas Herault and Yves Robert, and to be published by Springer Verlag.

show abstract

Modeling and tolerating heterogeneous failures in large parallel systems

Cited by 66 publications

References 22 publications

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

Monitoring strategies for scalable dynamic checkpointing

Fault-Tolerance Techniques for High-Performance Computing

Contact Info

Product

Resources

About