In order to adopt high performance clusters and grid computing for mission critical applications, fault tolerance is a necessity.
The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the collaborative community and should be guaranteed. Especially, in an existing environment where job sites are Beowulf cluster systems, a service node failure may render the whole system outage. Current grid fault tolerance techniques only address these issues in an opportunistic fashion. Thus, there is a need for complementing these approaches by proactively handling failures at a job-site level, ensuring the system high availability with no loss of user submitted jobs. Our grid-aware cluster resource management effort was motivated by the fact that a cluster turns into a popular job site in the computational rid environment. We propose a solution dealing with fault tolerance at the service level complementing the task-based solutions as being done in some recent studies. We discuss various service availability issues related to the grid, and preliminary results obtained while implementing the smart failover and transparent job-queue replication mechanism and the automated grid installation package. Our report entails the benefits outweighing acceptable overhead after implementing our proof-of-concept framework.
Physicists today have employed grid technology to overcome various resource level hurdles. The collective resource utilization achieved through grid computing is critical to the overall computing capacity of the community and should be guaranteed. In an environment where job sites are cluster systems, a service node failure renders a whole system outage. Our grid-aware HA-OSCAR effort was motivated by the popularity of the cluster architecture in the Grid environment. We propose the high-availability architecture, HA-OSCAR, for cluster-based job sites in the grid environment. This architecture deals with fault tolerance at the service level complementing task-based solutions such as checkpoint/restart. We discuss various service availability issues related to the grid, some issues and preliminary results obtained while implementing the smart failover feature and the automated grid installation package. Our report entails the performance benefits achieved after applying the HA-OSCAR solution to the cluster components of the grid compared to regular Beowulf style cluster solutions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.