Towards Dynamic Resource Management with MPI Sessions and PMIx

Huber, Dominik; Streubel, Maximilian; Comprés, Isaías; Schulz, Martin; Schreiber, Martin; Pritchard, Howard

doi:10.1145/3555819.3555856

Cited by 17 publications

(7 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Monitoring solutions, for example, could be used here to find the right timing for such operations. Applications supporting advanced forms of checkpoint/restart or dynamic load balancing [24], [25] are good examples, being massively parallel and of an iterative nature with prescribed points where resource changes can be accommodated. Also, certain kinds of applications, e.g., those based on task farming or parallel-in-time codes based on the Parareal algorithm [26], can benefit from malleability without significant code changes.…”

Section: A Applicationsmentioning

confidence: 99%

“…II-C), for example, the user-level code must outline the reconfiguration phase and carefully define datamanagement procedures enabling the remapping of the various operands. Nonetheless, more generic support could, e.g., be devised either inside PETSc [25] that controls the data or task distribution of applications or directly through MPI [24], with a set of extensions for dynamic resources and AMR.…”

Section: A Applicationsmentioning

confidence: 99%

“…3) Dynamic Processes with Process Sets (DPP): A recent approach [24], [59], [60] introduced dynamic MPI Process interface extensions and an implementation based on Open MPI [61]. This approach follows the principles further described in [60] allowing, e.g., adding/removing processes to/from the application at runtime.…”

Section: A Programming Modelsmentioning

confidence: 99%

“…Recently, a prototype was developed, which extends the PRRTE, OpenPMIx, and Open MPI implementations to support a dynamic MPI Sessions interface [24]. These extensions allow applications to request the addition of processes to and removal of processes from the application during runtime.…”

Section: B Process Manager / Runtime Environmentmentioning

confidence: 99%

“…This requires it to be programmable in an easy way. Here, we envision hiding this complexity in three different ways: 1) Provide malleability support within commonly used parallel software and parallel libraries, which lowers the bar for using malleability (see Huber et al [24], which uses a dynamic resource extension for p4est) or makes malleability almost transparent to the application developers using these updated software packages. 2) We envision a standardized layer between the application and MPI that provides various functionalities to again lower the bar for utilizing malleability.…”

Section: Guiding the Future Research And Conclusionmentioning

confidence: 99%

See 4 more Smart Citations

Malleability in Modern HPC Systems: Current Experiences, Challenges, and Future Opportunities

Tarraf,

Schreiber,

Cascajo

et al. 2024

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

With the increase of complex scientific simulations driven by workflows and heterogeneous workload profiles, managing system resources effectively is essential for improving performance and system throughput, especially due to trends like heterogeneous HPC and deeply integrated systems with on-chip accelerators. For optimal resource utilization, dynamic resource allocation can improve productivity across all system and application levels, by adapting the applications' configurations to the system's resources. In this context, malleable jobs, which can change resources at runtime, can increase the system throughput and resource utilization while bringing various advantages for HPC users (e.g., shorter waiting time). Malleability has received much attention recently, even though it has been an active research area for almost two decades [1]. This paper presents the state-of-the-art of malleable implementations in HPC systems, targeting mainly malleability in compute and I/O resources. Based on our experiences, we state our current concerns and list future opportunities for research.

show abstract

Section: A Applicationsmentioning

confidence: 99%

Section: A Applicationsmentioning

confidence: 99%