S.D. Kleban scite author profile

This paper characterizes "queue storms" in supercomputer systems and discusses methods for quelling them. Queue storms are anomalously large queue lengths dependent upon the job size mix, the queuing system, the machine size, and correlations and dependencies between job submissions. We use synthetic data generated from actual job log data from the ASCI Blue Mountain supercomputer combined with different long-range dependencies. We show the distribution of times from the first storm to occur, which is in a sense the time when the machine becomes obsolete because it represents the time when the machine first fails to provide satisfactory turnaround. To overcome queue storms, more resources are needed even if they appear superfluous most of the time. We present two methods, including a grid-based solution, for reducing these correlations and their resulting effect on the size and frequency of queue storms.

show abstract

Computation-at-risk: assessing job portfolio management risk on clusters

Kleban

Clearwater²

View full text Add to dashboard Cite

In this paper we introduce the concept of Computationat-Risk, CaR, a methodology, procedure, and quantity of computational risk and reward resulting from running a particular portfolio of jobs on a cluster under a specific queue policy. Modeled after Value-at-Risk, VaR, from the financial community, CaR introduces the new element of computational risk into the management of a computational cluster. Specifically, administrators of clusters and other large-scale computing systems must deal with a wide range of job sizes, often up to eight orders of magnitude in the number of cycles. Such a job portfolio has implicit risks and rewards to performance both for certain types of jobs and to the facility overall. In this paper we quantify the risk and reward in terms of makespan and expansion factor. We assess the risk/reward profile for two categories of job portfolios, one with respect to queue settings and the other in terms of job sizes. These assessments provide a means for evaluating which queue policies or job sizes have the best risk/reward characteristics in terms of performance. We found that looser constraints on queue policy in the form run-time limits were beneficial from a risk/reward and CaR perspective. This information can be used by administrators to modify queue policy and by users to tailor the size of their jobs.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

S.D. Kleban

Hierarchical Dynamics, Interarrival Times, and Performance

Fair share on high performance computing systems: what does fair really mean?

IDSim: an extensible framework for Interoperable Distributed Simulation

Quelling queue storms

Computation-at-risk: assessing job portfolio management risk on clusters

Contact Info

Product

Resources

About