No abstract
While previous work has shown MPI to provide capabilities for system software, actual adoption has not widely occurred. We discuss process management shortcomings in MPI implementations and their impact on MPI usability for system software and managment tasks. We introduce MPISH, a parallel shell designed to address these issues.
Cloud resources promise to be an avenue to address new categories of scientific applications including data-intensive science applications, on-demand/surge computing, and applications that require customized software environments. However, there is a limited understanding on how to operate and use clouds for scientific applications. Magellan, a project funded through the Department of Energy's (DOE) Advanced Scientific Computing Research (ASCR) program, is investigating the use of cloud computing for science at the Argonne Leadership Computing Facility (ALCF) and the National Energy Research Scientific Computing Facility (NERSC). In this paper, we detail the experiences to date at both sites and identify the gaps and open challenges from both a resource provider as well as application perspective.
We describe the use of component architecture in an area to which this approach has not been clmsicully applied, the area of cluster system software. By "chster system .software," we mean the collection of programs used in configuring and maintainzng individual nodes, together with the software involved in submissiou, scheduling, monitoring, and termination of jobs. We describe how the component upprouch maps onto the cluster systems software problem, together with our experiences uith the approach in implementing an albnew suite of systems software for a medium-sized cluster with unusually complex systems software requirements.
Systems software for clusters typically derives from a multiplicity of sources: the kernel itself, software associated with a particular distribution, site-specific purchased or open-source software, and assorted home-grown tools and procedures that attempt to glue everything together to meet the needs of the users and administrators of a particular cluster. Whether a cluster is a general-purpose resource serving multiple users or dedicated to a single application, getting everything to work together is a challenge. The challenge is partially met by special software distributions for clusters such as OS-CAR or ROCKS. Here we discuss another approach (although it is not inconsistent with existing distributions), in which a small number of concepts are deployed to facilitate the customized integration of various software tools for cluster management, operation, and user jobs. The concepts include (1) a component approach to basic system software such as schedulers, queue managers, process managers, and monitors; (2) a software development kit for constructing networks of system software components, either from scratch or by wrapping "foreign" software, and (3) the use of explicit parallelism in building system tools for high performance. We illustrate this approach with a description of a mid-sized general-purpose cluster operated entirely by software built this way.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.