We describe in this paper a new method for building an efficient algorithm for scheduling jobs in a cluster. Jobs are considered as parallel tasks (PT) which can be scheduled on any number of processors. The main feature is to consider two criteria that are optimized together. These criteria are the makespan and the weighted minimal average completion time (minsum). They are chosen for their complementarity, to be able to represent both user-oriented objectives and system administrator objectives.We propose an algorithm based on a batch policy with increasing batch sizes, with a smart selection of jobs in each batch. This algorithm is assessed by intensive simulation results, compared to a new lower bound (obtained by a relaxation of ILP) of the optimal schedules for both criteria separately. It is currently implemented in an actual real-size cluster platform.
Today, large scale parallel systems are available at low cost. Many powerful such systems have been installed all over the world and the number of users is always increasing. The difficulty of using them efficiently is growing with the complexity of the interactions between more and more architectural constraints and the diversity of the applications. The design of efficient parallel algorithms has to be reconsidered under the influence of new parameters of such platforms (namely, cluster, grid and global computing) which are characterized by a larger number of heterogeneous processors, often organized in several hierarchical sub-systems. At each step of the evolution of the parallel processing field, researchers designed adequate computational models whose objective was to abstract the real world in order to be able to analyze the behavior of algorithms.In this paper, we will investigate two complementary computational models that have been proposed recently: Parallel Task (PT) and Divisible Load (DL). The Parallel Task (i.e. tasks that require more than one processor for their execution) model is a promising alternative for scheduling parallel applications, especially in the case of slow communication media. The basic idea is to consider the application at a coarse level of granularity. Another way of looking at the problem (which is somehow a dual view) is the Divisible Load model where an application is considered as a collection of a large number of elementary -sequential -computing units that will be distributed among the available resources. Unlike the PT model, the DL model corresponds to a fine level of granularity. We will focus on the PT model, and discuss how to mix it with simple Divisible Load scheduling.As the main difficulty for distributing the load among the processors (usually known as the scheduling problem) in actual systems comes from handling efficiently the communications, these two models of the problem allow us to consider them implicitly or to mask them, thus leading to more tractable problems.We will show that in spite of the enormous complexity of the general scheduling problem on new platforms, it is still useful to study theoretical 1 Contact Author: Denis Trystram (Denis.Trystram@imag.fr) 1 models. We will focus on the links between models and actual implementations on a regional grid with more than 500 processors.
Today, large scale parallel systems are available at relatively low cost. Many powerful such systems have been installed all over the world and the number of users is always increasing. The use of such clusters require special administration tools, which have been developed recently, and most frequently rely on intuitive heuristics to solve the underlying difficult scheduling problems. On the other hand, recent theoretical work has designed a number of models, such as the Parallel Task model, specifically aimed at cluster environments.The objective of this work is to study and analyze the path from theoretical models and algorithms to their implementation on an actual environment on a real platform. We outline in details the divergences between classical models and a real batch scheduling system, and propose solutions to adapt a theoretically well-founded algorithm to such a system. Experimental results show that this implementation perform about as well as FCFS with backfilling, and much better in the difficult instances. We hope that further usage of this algorithm in a live system will show that interaction between theory and practice can be fruitful.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.