The frequent and volatile unavailability of volunteer-based Grid computing resources challenges Grid schedulers to make effective job placements. The manner in which host resources become unavailable will have different effects on different jobs, depending on their runtime and their ability to be checkpointed or replicated. A multi-state availability model can help improve scheduling performance by capturing the various ways a resource may be available or unavailable to the Grid. This paper uses a multi-state model and analyzes a machine availability trace in terms of that model. Several prediction techniques then forecast resource transitions into the model's states. We analyze the accuracy of our predictors, which outperform existing approaches. We also propose and study several classes of schedulers that utilize the predictions, and a method for combining scheduling factors. We characterize the inherent tradeoff between job makespan and the number of evictions due to failure, and demonstrate how our schedulers can navigate this tradeoff under various scenarios. Lastly, we propose job replication techniques, which our schedulers utilize to replicate those jobs that are most likely to fail. Our replication strategies outperform others, as measured by improved makespan and fewer redundant operations. In particular, we define a new metric for replication efficiency, and demonstrate that our multi-state availability predictor can provide information that allows our schedulers to be more efficient than others that blindly replicate all jobs or some static percentage of jobs.
The functional heterogeneity of non-dedicated computational grids will increase with the inclusion of resources from desktop grids, P2P systems, and even mobile grids. Machine failure characteristics, as well as individual and organizational policies for resource usage by the grid, will increasingly vary even more than they already do. Since grid applications also vary as to how well they tolerate the failure of the host on which they run, grid schedulers must begin to predict and consider how resources will transition between availability modes. Toward this goal, this paper introduces five availability states, and characterizes a Condor pool trace that uncovers when, how, and why its resources reside in, and transition between, these states. This characterization suggests resource categories that schedulers can use to make better mapping decisions. Simulations that characterize how a variety of jobs would run on the traced resources demonstrate this approach's potential for performance improvement. A simple predictor based on the previous day's behavior indicates that the states and categories are somewhat predictable, thereby supporting the potential usefulness of multi-state grid resource availability characterization. 1
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations鈥揷itations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.