Resource management is a well known problem in almost every computing system ranging from embedded to High Performance Computing (HPC) and is useful to optimize multiple orthogonal system metrics such as power consumption, performance and reliability. To achieve such an optimization a resource manager must suitably allocate the available system resources - e.g. processing elements, memories and interconnect - to the running applications. This kind of process incurs in two main problems: a) system resources are usually shared between multiple applications and this induces resource contention; and b) each application requires a different Quality of Service, making it harder for the re- source manager to work in an application-agnostic mode. In this scenario, resource management represents a critical and essential component in a computing system and should act at different levels to optimize the whole system while keeping it exible and versatile. In this paper we describe a multi-layer resource management strategy that operates at application, operating system and hardware level and tries to optimize resource allocation on embedded, desktop multi-core and HPC systems
This paper introduces the mig framework: an Open MPI extension to transparently support the migration of application processes, over different nodes of a distributed High-Performance Computing (HPC) system. The framework provides mechanism on top of which suitable resource managers can implement policies to react to hardware faults, address performance variability, improve resource utilization, perform a fine-grained load balancing and power thermal management.Compared to other state-of-the-art approaches, the mig framework does not require changes in the application code. Moreover, it is highly maintainable, since it is mainly a selfcontained solution that has required a very few changes in other already existing Open MPI frameworks. Experimental results have shown that the proposed extension does not introduce significant overhead in the application execution, while the penalty due to performing a migration can be properly taken into account by a resource manager. CCS Concepts•Software and its engineering → Checkpoint / restart; Multiprocessing / multiprogramming / multitasking; •Computing methodologies → Parallel computing methodologies; •Computer systems organization → Distributed architectures;
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.