An Architectural Design of a Job Management System Leveraging Software Defined Network

Kido

2013 19th IEEE International Conference on Networks (ICON)

et al. 2013

Self Cite

Network performance in high-performance computing environments such as supercomputers and Grid systems takes a role of great importance in deciding the overall performance of computation. However, most Job Management Systems (JMSs) available today, which are responsible for managing multiple computing resources for distribution and balancing of a computational workload, do not consider network awareness for resource management and allocation. In this paper, the authors briefly overview our proposed and prototyped network-aware JMS that can allocate an appropriate set of computing and network resources to a job request. Also, we evaluate the usefulness and effectiveness of our proposal. Experiments conducted with the prototype implementation imply that our proposed networkaware JMS could reduce job execution time by 23.4 percent. I. INTRODUCTIONIn the area of scientific computation using parallel and distributed computing techniques, network performance affects the total execution time of computation. Recently, the dominant trend in computer architecture for high-performance computing has been cluster systems which are composed of multiple computers on a high-speed network. More than 80 percent of high-performance computers are cluster systems [1]. To gain high performance on a cluster system, the communication time must be inevitably reduced. Particularly, in large-scale and distributed computing environments such as campus Grid system, communication overhead becomes the prime limiting factor.Most cluster systems available today have deployed JMSs such as NQS [2], PBS [3] and the Open Grid Scheduler/Grid Engine (OGS/GE) [4]. The JMSs are generally used for computational workload distribution and balancing purposes. The user can submit a job to JMS without being aware which computing hosts of a cluster system are available. However, such traditional JMSs are designed to allocate only computing resources such as CPU and memory to each job submitted to them, without taking network performance into account. A major reason for this is explained from the assumption that network resources of a cluster system always have enough capacity to accommodate multiple job execution simultaneously.

Section: B Openflowmentioning

confidence: 99%

Section: A Missing Functionalities In Traditional Jmssmentioning

confidence: 99%

See 1 more Smart Citation

Prototyping and evaluation of a network-aware Job Management System on a cluster system

Kido

2013 19th IEEE International Conference on Networks (ICON)

et al. 2013

Self Cite

“…In this section, we briefly introduce the SDN concept and OpenFlow technology, and explain resource management for an HPC cluster system with a fat-tree interconnect on our proposed SDN-enhanced JMS framework [8], [9].…”

Section: Sdn-enhanced Jmsmentioning

confidence: 99%

“…Currently, we have been developing a network-aware JMS integrated Software-Defined Networking (SDN) concept, which can dynamically control an entire network in a centralized manner, into a traditional JMS [8], [9]. The framework of our proposed SDN-enhanced JMS has mechanisms for monitoring the use of network resources and allocating communication paths to jobs, and allows an administrator to define how to allocate both computational and network resources to jobs in accordance with system architecture and operating policy.…”

Section: Introductionmentioning

confidence: 99%

Efficacy Analysis of a SDN-enhanced Resource Management System through NAS Parallel Benchmarks

Abe

et al. 2014

Rev Socionetwork Strat

Self Cite

In the field of social science, a variety of high-performance computing simulations such as the Monte Carlo simulation and the Multi-agent simulation must be efficiently performed to deal with social scientific big data. To facilitate Rev Socionetwork Strat (2014) 8:69-84 70 social scientists in performing their own analysis against such big data, the information infrastructure for social science must be equipped with a core technology that efficiently and effectively leverages limited resources available on the information infrastructure. From such a perspective, a new type of job management technology, which treats not only computational resources such as the Central Processing Unit (CPU) and memory, but also network resources unlike traditional job management, is investigated in this paper. A cluster system with a fat-tree topology interconnect is conventional cluster architecture these days. For this investigation, the National Aeronautics Space Administration Advanced Supercomputing, USA (NAS) Parallel Benchmarks, which contain computation patterns often observed in social scientific simulations, are used to assess the efficacy of the resource allocation by our proposed job management technology on a cluster system with a fat-tree topology interconnect.

Performance Characteristics of an SDN-Enhanced Job Management System for Cluster Systems with Fat-Tree Interconnect

2014 IEEE 6th International Conference on Cloud Computing Technology and Science

Abe

et al. 2014

Self Cite

In the era of cloud computing, data centers that accommodate a series of user-requested jobs with a diversity of resource usage pattern need to have the capability of efficiently distributing resources to each user job, based on individual resource usage patterns. In particular, for high-performance computing as a cloud service which allows many users to benefit from a large-scale computing system, a new framework for resource management that treats not only the CPU resources, but also the network resources in the data center is essential. In this paper, an SDN-enhanced JMS that efficiently handles both network and CPU resources and as a result accelerates the execution time of user jobs is introduced as a building block technology for such a HPC cloud. Our evaluation shows that the SDN-enhanced JMS efficiently leverages the fat-tree interconnect of cluster systems running behind the cloud to suppress the collision of communications generated by different jobs.Collision Collision Path 1 Path 2 SW3 SW4 SW5 SW6 (b) Inefficient allocation of a job (J3). SW1 SW2 J 0 J 0 J 3 J 3 J 2 J 1 J