Network performance in high-performance computing environments such as supercomputers and Grid systems takes a role of great importance in deciding the overall performance of computation. However, most Job Management Systems (JMSs) available today, which are responsible for managing multiple computing resources for distribution and balancing of a computational workload, do not consider network awareness for resource management and allocation. In this paper, the authors briefly overview our proposed and prototyped network-aware JMS that can allocate an appropriate set of computing and network resources to a job request. Also, we evaluate the usefulness and effectiveness of our proposal. Experiments conducted with the prototype implementation imply that our proposed networkaware JMS could reduce job execution time by 23.4 percent.
I. INTRODUCTIONIn the area of scientific computation using parallel and distributed computing techniques, network performance affects the total execution time of computation. Recently, the dominant trend in computer architecture for high-performance computing has been cluster systems which are composed of multiple computers on a high-speed network. More than 80 percent of high-performance computers are cluster systems [1]. To gain high performance on a cluster system, the communication time must be inevitably reduced. Particularly, in large-scale and distributed computing environments such as campus Grid system, communication overhead becomes the prime limiting factor.Most cluster systems available today have deployed JMSs such as NQS [2], PBS [3] and the Open Grid Scheduler/Grid Engine (OGS/GE) [4]. The JMSs are generally used for computational workload distribution and balancing purposes. The user can submit a job to JMS without being aware which computing hosts of a cluster system are available. However, such traditional JMSs are designed to allocate only computing resources such as CPU and memory to each job submitted to them, without taking network performance into account. A major reason for this is explained from the assumption that network resources of a cluster system always have enough capacity to accommodate multiple job execution simultaneously.