Given the cost of HPC clusters, making best use of them is crucial to improve infrastructure ROI. Likewise, reducing failed HPC jobs and related waste in terms of user wait times is crucial to improve HPC user productivity (aka human ROI). While most efforts (e.g., debugging HPC programs) explore technical aspects to improve ROI of HPC clusters, we hypothesize non-technical (human) aspects are worth exploring to make non-trivial ROI gains; specifically, understanding non-technical aspects and how they contribute to the failure of HPC jobs.In this regard, we conducted a case study in the context of Beocat cluster at Kansas State University. The purpose of the study was to learn the reasons why users terminate jobs and to quantify wasted computations in such jobs in terms of system utilization and user wait time. The data from the case study helped identify interesting and actionable reasons why users terminate HPC jobs. It also helped confirm that user terminated jobs may be associated with non-trivial amount of wasted computation, which if reduced can help improve the ROI of HPC clusters.
MotivationGiven the cost of creating and operating high-performance computing (HPC) clusters, making best use of the clusters is crucial for infrastructure ROI, e.g., creation of a level 3 XSEDE [4] cluster like Beocat at Kansas State University (described in Section 2.2), can easily costs more than 2 million US dollars. Beyond merely keeping processors busy and memory/storage occupied, this is about the usefulness of computations performed on clusters, i.e., the results from computations are not wasted due to them being incomplete or incorrect or irrelevant. This latter goal is often pursued by exploring techniques to reduce HPC job failures stemming from hardware and/or software failures. In particular, there has been considerable interest in the HPC community to improve infrastructure ROI by identifying and tackling hurdles rooted in technical aspects of HPC. For example, in 2017, DOE published a technical report focused on needs and ways to specify, test/verify, and debug massively parallel programs [10]. There have been empirical studies to characterize and understand job failures by considering 1) various non-human factors such as spatial and temporal dependences between failures, power quality, temperature, and radiation [8], and 2) different statistics such as mean time between failures, mean time to repair [15], and submission inter-arrival time [19]. Ahrens et al. studied the use of HPC for data-intensive science in the US DOE and identified various challenges: support to monitor progress of computation and steer computation in real time, use novel and apt data abstractions and representations, and leverage couplings between experiments [5]. Faulk et al. attempted to define and measure HPC productivity in terms of science accomplished and the involved artifacts [9].While improving infrastructure ROI is important, we conjecture the community should also focus on improving human ROI. By human ROI (aka HPC user productivi...