This paper characterizes "queue storms" in supercomputer systems and discusses methods for quelling them. Queue storms are anomalously large queue lengths dependent upon the job size mix, the queuing system, the machine size, and correlations and dependencies between job submissions. We use synthetic data generated from actual job log data from the ASCI Blue Mountain supercomputer combined with different long-range dependencies. We show the distribution of times from the first storm to occur, which is in a sense the time when the machine becomes obsolete because it represents the time when the machine first fails to provide satisfactory turnaround. To overcome queue storms, more resources are needed even if they appear superfluous most of the time. We present two methods, including a grid-based solution, for reducing these correlations and their resulting effect on the size and frequency of queue storms.
In this paper we introduce the concept of Computationat-Risk, CaR, a methodology, procedure, and quantity of computational risk and reward resulting from running a particular portfolio of jobs on a cluster under a specific queue policy. Modeled after Value-at-Risk, VaR, from the financial community, CaR introduces the new element of computational risk into the management of a computational cluster. Specifically, administrators of clusters and other large-scale computing systems must deal with a wide range of job sizes, often up to eight orders of magnitude in the number of cycles. Such a job portfolio has implicit risks and rewards to performance both for certain types of jobs and to the facility overall. In this paper we quantify the risk and reward in terms of makespan and expansion factor. We assess the risk/reward profile for two categories of job portfolios, one with respect to queue settings and the other in terms of job sizes. These assessments provide a means for evaluating which queue policies or job sizes have the best risk/reward characteristics in terms of performance. We found that looser constraints on queue policy in the form run-time limits were beneficial from a risk/reward and CaR perspective. This information can be used by administrators to modify queue policy and by users to tailor the size of their jobs.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.