Cluster computing has been attracting more and more attention from both the industry and the academia for its enormous computing power, cost effectiveness, and scalability. Availability is a key system attribute that needs to be considered both at system design stage and must reflect the actuality. System monitoring and logging enables identifying unplanned events to reflect the actual system's availability. This paper proposes a single framework that coordinates event monitoring, filtering, data analysis and dynamic availability modeling. The availability model is abstracted and categorized based on functionality. We describe the proposed architecture, and a sample analysis of real time event logs from a 512 node cluster from Lawrence Livermore National Laboratory.
PurposeTo develop a numerical method for solving hyperbolic two‐step micro heat transport equations, which have attracted attention in thermal analysis of thin metal films exposed to ultrashort‐pulsed lasers.Design/methodology/approachAn energy estimation for the hyperbolic two‐step model in a three‐dimensional (3D) micro sphere irradiated by ultrashort‐pulsed lasers is first derived, and then a finite difference scheme for solving the hyperbolic two‐step model based on the energy estimation is developed. The scheme is shown to be unconditionally stable and satisfies a discrete analogue of the energy estimation. The method is illustrated by investigating the heat transfer in a micro gold sphere exposed to ultrashort‐pulsed lasers.FindingsProvides information on normalized electron temperature change with time on the surface of the sphere, and shows the changes in electron and lattice temperatures.Research limitations/implicationsThe hyperbolic two‐step model is considered under the assumption of constant thermal properties.Practical implicationsA useful tool to investigate the temperature change in a micro sphere irradiated by ultrashort‐pulsed lasers.Originality/valueProvides a new unconditionally stable finite difference scheme for solving the hyperbolic two‐step model in a 3D micro sphere irradiated by ultrashort‐pulsed lasers.
In recent years, large scale clusters have been commonly deployed to solve important grandchallenge scientific problems. In order to reduce computational time, the system size has been increasingly expanded. Unfortunately, the reliability of such cluster systems goes in the opposite direction, as the extension of a system scale. Since failures of a single node could result in a system outage, it is essential to effectively deal with faulty situations in the grand challenge problem-solving environment. Checkpointing is one of common fault tolerance techniques. However, there are many challenges in checkpointing such as overhead, latency and consistency, as well as recovery. In this paper, a reliability-aware checkpoint/restart method was introduced. It is a novel technique to consider checkpointing placement based on system reliability. We constructed a cost model and derived an optimal checkpoint placement function based on failure rates: A trade-off between performance and reliability (i.e. performability) was a key consideration. We also implemented a proof-of-concept and demonstrated improvements resulting from our techniques for fault-tolerant MPI applications on an HA-OSCAR cluster.
Since the initial introduction of Open Source ClusterApplication Resources (OSCAR), this software package has been a well-accepted choice for building high performance computing systems. As it continues to be applied to mission-critical environments, high availability (HA) features therefore are needed to be included in OSCAR cluster. In this paper, we provide a HA solution for OSCAR cluster. As a widely used technique in HA solutions, component redundancy is adopted to improve the system availability. Based on the proposed architecture, we develop a detailed failure-repair model for predicting the availability of HA OSCAR cluster. Stochastic reward nets (SRN) are used to model the behavior of the system. We specify our SRN model to Stochastic Petri Net Package, describe and compute several interesting output measures that characterize availability features of HA OSCAR cluster.
This paper illustrates the uniformization procedure for a numerical transient solution of large sparse Markov processes. Alternative implementations are discussed, particularly on the small state space of a Markov process. A light weight solution is presented for solving large sparse systems. The light weight solution has the merits of clear modularity, platform independency, and ease of deployment on the network for remote computing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.