High performance computing (HPC) systems use checkpoint-restart to tolerate failures. Typically, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem through a scalable checkpoint-restart system, mcrEngine. McrEngine aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely-used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that mcrEngine reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression.
The era of extremely heterogeneous supercomputing brings with itself the devil of increased performance variation and reduced reproducibility. There is a lack of understanding in the HPC community on how the simultaneous consideration of network traffic, power limits, concurrency tuning, and interference from other jobs impacts application performance. In this paper, we design a methodology that allows both HPC users and system administrators to understand the trade-off space between optimal and reproducible performance. We present a firstof-its-kind dataset that simultaneously varies multiple system-and user-level parameters on a production cluster, and introduce a new metric, called the desirability score, which enables comparison across different system configurations. We develop a novel, model-agnostic machine learning methodology based on the graph signal theory for comparing the influence of parameters on application predictability, and using a new visualization technique, make practical suggestions for best practices for multi-objective HPC environments.
CCS Concepts• Computing methodologies → Parallel computing methodologies, Machine learning, Model development and analysis;• General and reference → Performance.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.