Experiments presented in this paper were carried out using the Grid'5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr).The experiments using our prototype system were carried out on machines owned by the Deparment of Computer Science and Engineering at the University of California, San Diego with the permission of Geoffrey Voelker. System administration and much assistance were provided by Gjergji Zyba.
iv
ABSTRACTThis research focuses on the problem of job scheduling on homogeneous computational clusters. Clusters are widely used today for a variety of purposes, including high-performance scientific computing and Internet service hosting. While clusters may have impressive aggregate performance metrics, they are really only collections of fairly modest machines, which makes scheduling jobs for the best performance a non-trivial problem. Most clusters also need to be shared among users to amortize their start-up and maintenance costs, and ensuring that these users are treated fairly further adds to the difficulty. Existing approaches to scheduling attempt to address both of these issues, but have several limitations.We propose a novel approach, called Dynamic Fractional Resource Scheduling (DFRS), to sharing homogeneous cluster computing platforms among competing jobs.The key features of DFRS are that it leverages existing virtual machine technology in order to share resources more efficiently and it defines and optimizes a user-centric metric that captures notions of both performance and fairness. In this dissertation we explain the principles behind DFRS and its advantages over the current state of the art, develop a theoretical model of resource sharing, design heuristics to optimize the proposed metric within the given framework, implement and run simulations comparing DFRS to traditional approaches using popular and accepted performance metrics, and finally develop and test a prototype implementation based on existing technologies. Our results show that it is possible to develop heuristic algorithms that give results reasonably close to theoretical bounds for a variety of cases, that resource requirements are well within the capabilities of modern systems, and that for some scenarios DFRS can provide orders-ofmagnitude levels of improvement in performance over current approaches.