Abstract-Clusters of computers can provide, in aggregate, reliable services despite the failure of individual computers. System-levelv irtualization is widely used to consolidate the workload of multiple physical systems as multiple virtual machines (VMs) on a single physical computer.Asingle physical computer thus forms a virtual cluster of VMs. Akey difficulty with virtualization is that the failure of the virtualization infrastructure (VI) often leads to the failure of multiple VMs. This is likely to overload ''cluster computing'' resiliencym echanisms, typically designed to tolerate the failure of only a single node at a time. By supporting recovery from failure of key VIc omponents, we have enhanced the resiliencyo faV I( Xen), thus enabling the use of existing ''cluster computing''t echniques to provide resilient virtual clusters. In the overwhelming majority of cases, these enhancements allowr ecovery from errors in the VI to be accomplished without the failure of more than a single VM. The resulting resiliencyo ft he virtual cluster is demonstrated by running twoe xisting ''cluster computing''s ystems while subjecting the VI to injected faults.
Asynchronous events and complexs ystem state distributed across independent nodes makee xposure and diagnosis of flaws in distributed systems a challenge. The difficulties are exacerbated when the goal is to validate fault tolerance mechanisms that are activated only by the occurrence of errors, which are, by nature, rare. Va lidation of fault tolerance mechanisms is often done by injecting faults that emulate the actual faults and ''stress''t he functionality of the resilience mechanisms. Va lidation campaigns lasting days and involving thousands of fault injections are often necessary.W ep resent an infrastructure that combines virtualization and software-implemented fault injection to automate validation campaigns and support the analysis of the behavior of a distributed system under test. Virtualization enables: 1) aflexible fault injector capable of emulating a wide variety of faults, and 2) am echanism for autonomously recovering faulty nodes so that the campaign can continue running on a target system that is fully functional. As ac ase study we use this infrastructure to validate a Byzantine-fault-tolerant cluster manager.O ver 1280 hours of fault injections yielded the exposure of 11 unique flaws in the cluster manager.
We describe the communication infrastructure (CI)
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.