The Markov model of reliability of a failover cluster performing calculations in a cyber-physical system is considered. The continuity of the cluster computing process in the event of a failure of the physical resources of the servers is provided on the basis of virtualization technology and is associated with the migration of virtual machines. The difference in the proposed model is that it considers the restrictions on the allowable time of interruption of the computational process during cluster recovery. This limitation is due to the fact that, if two physical servers fail, then object management is lost, which is unacceptable. Failure occurs if their recovery time is longer than the maximum allowable time of interruption of the computing process. The modes of operation of the cluster with and without system recovery in the event of a failure of part of the system resources that do not lead to loss of continuity of the computing process are considered. The results of the article are aimed at the possibility of assessing the probability of cluster operability while supporting the continuity of computations and its running to failure, leading to the interruption of the computational (control) process beyond the maximum permissible time. As a result of the calculation example for the presented models, it was shown that the mean time to failure during recovery under conditions of supporting the continuity of the computing process increases by more than two orders of magnitude.
Ensuring high reliability, fault tolerance and computing process continuity of computer systems is supported by clustering computing resources. It is based on the virtualization technology as a result of moving virtual resources, services, or appli cations between physical servers with the support of computing process continuity.The object of study is a fault-tolerant cluster, which in the simplest case consists of two physical servers (primary and backup) connected through a switch. Each server has a local hard disk. Server local disks have a distributed storage system with data synchronous replication from the source server to the backup server. The virtual machine is running on the cluster. The system involves running a shadow copy of the virtual machine on a backup server, which allows computational process implementation without interruption after the primary server fails to continue its implementation on the virtual machine backup server. Stationary and nonstationary availability coefficients are used as a reliability indicator.The paper proposes the Markov reliability model of a fault-tolerant cluster, which takes into account virtual machine migration costs, as well as mechanisms ensuring the continuity of the computing process (service) in cluster in case of one physical server failure. After migration, two copies of virtual machines located in different physical servers are supported in memory, so that in case of failure of one of them to continue working on the second one.There is a developed simplified model of a fault-tolerant cluster that ignores the costs of virtual machine migration when restoring a cluster. It gives an upper reliability evaluation. The paper shows the notable impact of the virtual machine migration process on the failover cluster reliability (measured by a non-stationary availability coefficient).The obtained results can be used to justify the choice of fault tolerance and continuity of the computing process of computer systems of cluster architecture.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.