Abstract-The compass of Cloud infrastructure services advances steadily leaving users in the agony of choice. To be able to select the best mix of service offering from an abundance of possibilities, users must consider complex dependencies and heterogeneous sets of criteria. Therefore, we present a PhD thesis proposal on investigating an intelligent decision support system for selecting Cloud-based infrastructure services (e.g. storage, network, CPU). The outcomes of this will be decision support tools and techniques, which will automate and map users' specified application requirements to Cloud service configurations.
Applications performing ultra-large scale simulations via solving PDEs require very large computational systems for their timely solution. Studies have shown the rate of failure grows with the system size and these trends are likely to worsen in future machines as less reliable components are used to reduce the energy cost. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) is a cost-effective method for solving time-evolving PDEs, especially for higherdimensional problems. It can also be easily modified to provide algorithm-based fault tolerance for these problems. In this paper, we show how the SGCT can produce a fault-tolerant version of the GENE gyrokinetic plasma application, which evolves a 5D complex density field over time. We use an alternate component grid combination formula to recover data from lost processes. User Level Failure Mitigation (ULFM) MPI is used to recover the processes, and our implementation is robust over multiple failures and recovery for both process and node failures. An acceptable degree of modification of the application is required. Results using the SGCT on two of the fields' dimensions show competitive execution times with acceptable error (within 0.1%), compared to the same simulation with a single full resolution grid. The benefits improve when the SGCT is used over three dimensions. Our experiments show that the GENE application can successfully recover from multiple process failures, and applying the SGCT the corresponding number of times minimizes the error for the lost sub-grids. Application recovery overhead via ULFM MPI increases from ∼1.5s at 64 cores to ∼5s at 2048 cores for a oneoff failure. This compares favourably to using GENE's in-built checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the backtrack overhead. An analysis for a long-running application taking into account checkpoint backtrack times indicates a reduction in overhead of over an order of magnitude.
Ultra-large-scale simulations via solving partial differential equations (PDEs) require very large computational systems for their timely solution. Studies shown the rate of failure grows with the system size, and these trends are likely to worsen in future machines. Thus, as systems, and the problems solved on them, continue to grow, the ability to survive failures is becoming a critical aspect of algorithm development. The sparse grid combination technique (SGCT) which is a cost-effective method for solving higher dimensional PDEs can be easily modified to provide algorithm-based fault tolerance. In this article, we describe how the SGCT can produce fault-tolerant versions of the Gyrokinetic Electromagnetic Numerical Experiment plasma application, Taxila Lattice Boltzmann Method application, and Solid Fuel Ignition application. We use an alternate component grid combination formula by adding some redundancy on the SGCT to recover data from lost processes. User-level failure mitigation (ULFM) message passing interface (MPI) is used to recover the processes, and our implementation is robust over multiple failures and recovery (processes and nodes). An acceptable degree of modification of the applications is required. Results using the 2-D SGCT show competitive execution times with acceptable error (within 0.1% to 1.0%), compared to the same simulation with a single full resolution grid. The benefits improve when the 3-D SGCT is used. Experiments show the applications ability to successfully recover from multiple failures, and applying multiple SGCT reduces the computed solution error. Process recovery via ULFM MPI increases from approximately 1.5 sec at 64 cores to approximately 5 sec at 2048 cores for a one-off failure. This compares applications' built-in checkpointing with job restart in conjunction with the classical SGCT on failure, which have overheads four times as large for a single failure, excluding the recomputation overhead. An analysis for a long-running application considering recomputation times indicates a reduction in overhead of over an order of magnitude.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.