Online failure prediction in cloud datacenters by real-time message pattern learning

Watanabe, Yukihiro; Otsuka, Hiroshi; Sonoda, Masataka; Kikuchi, Shinji; Matsumoto, Yasuhide

doi:10.1109/cloudcom.2012.6427566

Cited by 43 publications

(24 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They filter and clean data to improve a prediction model's performance. There has also been a whole array of techniques applied to predicting failures in distributed system and computing clusters, where the works of Watanabe et al [2] [3], who show a method of pattern learning for the prediction of failures, Salfner et al [6][11], who model a system using Semihidden Markov Models and add fuzzy logic to the OFP scenario, and, specially, Zheng et al [5] [7] are the main ones. The latter authors have worked for several years with the IBM Blue Gene supercomputer and have a long streak of papers related to the issue at hand.…”

Section: Related Workmentioning

confidence: 99%

Classification in Sparse, High Dimensional Environments Applied to Distributed Systems Failure Prediction

Navarro

Parada

Dueñas

2015

Artificial Intelligence and Soft Computing

View full text Add to dashboard Cite

Abstract. Network failures are still one of the main causes of distributed systems' lack of reliability. To overcome this problem we present an improvement over a failure prediction system, based on Elastic Net Logistic Regression and the application of rare events prediction techniques, able to work with sparse, high dimensional dataseis. Specifically, we prove its stability, fine tune its hyperparameter and improve its industrial utility by showing that, with a slight change in dataset creation, it can also predict the location of a failure, a key asset when trying to take a proactive approach to failure management.

show abstract

Section: Related Workmentioning

confidence: 99%

Classification in Sparse, High Dimensional Environments Applied to Distributed Systems Failure Prediction

Navarro

Parada

Dueñas

2015

Artificial Intelligence and Soft Computing

View full text Add to dashboard Cite

show abstract

“…In 2006, Liang et al [6] propose a series of three ad-hoc created predictors for the Blue Gene supercomputer, based on the analysis of the failure characteristics found on its logs. Watanabe et al propose a method quite similar to association rules in [7]: the creation of an event dictionary with the associated probabilities for each entry to precede a failure. On the other hand, in [8], Sonoda, Watanabe and Matsumoto show how to identify patterns that lead to system failures through Bayesian learning, achieving a precision of over 0.8 and recall over 0.7.…”

Section: Online Failure Predictionmentioning

confidence: 99%

System Failure Prediction through Rare-Events Elastic-Net Logistic Regression

Navarro

Parada²,

Dueñas

2014

2014 2nd International Conference on Artificial Intelligence, Modelling and Simulation

View full text Add to dashboard Cite

Abstract-Predicting failures in a distributed system based on previous events through logistic regression is a standard approach in literature. This technique is not reliable, though, in two situations: in the prediction of rare events, which do not appear in enough proportion for the algorithm to capture, and in environments where there are too many variables, as logistic regression tends to overfit on this situations; while manually selecting a subset of variables to create the model is errorprone. On this paper, we solve an industrial research case that presented this situation with a combination of elastic net logistic regression, a method that allows us to automatically select useful variables, a process of cross-validation on top of it and the application of a rare events prediction technique to reduce computation time. This process provides two layers of crossvalidation that automatically obtain the optimal model complexity and the optimal model parameters values, while ensuring even rare events will be correctly predicted with a low amount of training instances. We tested this method against real industrial data, obtaining a total of 60 out of 80 possible models with a 90% average model accuracy.

show abstract

“…There are many different design principles such as 'Eliminating Single Point of Failure' [27,38], 'Disaster Recovery' [39,40] and 'Real-Time and Fast Failure Detection' [41][42][43] that can help achieve high availability and reliability in cloud computing environments. The single point of failure (SPOF) in cloud computing datacenters can occur in both software and hardware level.…”

Section: Research Backgroundmentioning

confidence: 99%

“…The time taken to detect a failure is one of the key factors in the cloud computing environments. So, fast and real-time failure detection to identify or predict a failure in the early stages is one of the most important principles to achieving high availability and reliability in cloud systems [42,43]. Moreover, there are some new trends in cloud computing such as SDN-based technology like Espresso that makes cloud infrastructures more reliable and available in the network level [44].…”

Section: Research Backgroundmentioning

confidence: 99%

Reliability and high availability in cloud computing environments: a reference roadmap

Mesbahi

Rahmani

Hosseinzadeh

2018

Hum. Cent. Comput. Inf. Sci.

View full text Add to dashboard Cite

Reliability and high availability have always been a major concern in distributed systems. Providing highly available and reliable services in cloud computing is essential for maintaining customer confidence and satisfaction and preventing revenue losses. Although various solutions have been proposed for cloud availability and reliability, but there are no comprehensive studies that completely cover all different aspects in the problem. This paper presented a ‘Reference Roadmap’ of reliability and high availability in cloud computing environments. A big picture was proposed which was divided into four steps specifying through four pivotal questions starting with ‘Where?’, ‘Which?’, ‘When?’ and ‘How?’ keywords. The desirable result of having a highly available and reliable cloud system could be gained by answering these questions. Each step of this reference roadmap proposed a specific concern of a special portion of the issue. Two main research gaps were proposed by this reference roadmap.

show abstract

Online failure prediction in cloud datacenters by real-time message pattern learning

Cited by 43 publications

References 8 publications

Classification in Sparse, High Dimensional Environments Applied to Distributed Systems Failure Prediction

Classification in Sparse, High Dimensional Environments Applied to Distributed Systems Failure Prediction

System Failure Prediction through Rare-Events Elastic-Net Logistic Regression

Reliability and high availability in cloud computing environments: a reference roadmap

Contact Info

Product

Resources

About