Recently, the development of Kubernetes (K8s) containerization platform has enabled cloudbased, lightweight, highly scalable, and agile services in both general and telco use-cases. Ensuring high availability, reliable and continuous containerized services is a major requirement of service providers to provide fault-tolerance, transparent service experiences to end-users. To satisfy this requirement, fault prediction and proactive stateful service recovery features must be applied in cloud systems. Prior proactive failure recovery approaches mostly focused on either improving fault prediction performance based on different machine learning time series forecasting techniques or optimizing recovery service placement after fault prediction. However, a mechanism that enables stateful containerized service migration from the predicted faulty node to the healthy destination node has not been studied. Service migration in previous proactive works is only simulated or performed by virtual machine (VM) migration techniques. In this paper, we propose a proactive stateful fault-tolerant system for K8s containerized services that pipelines a Bidirectional Long Short-Term Memory (Bi-LSTM) fault prediction framework and a novel K8s stateful service migration mechanism for service recovery. Experimental results show how the Bi-LSTM model improved prediction performance against other time-series forecasting models in prior proactive works. We then combined the Bi-LSTM fault prediction framework with both the default K8s and our stateful migration mechanisms. The comparison between these two proactive systems proves our system efficiency in terms of avoiding Quality of Service (QoS) violation.
INDEX TERMS Containerization, proactive fault-tolerant, Kubernetes