Since the complexity of virtual experiments (VEs) and their underlying models is constantly increasing, computational performance of monolithic software solutions is rapidly becoming insufficient. Examples of VEs are probabilistic design, model calibration, optimal experimental design and scenario analysis. In order to tackle this computational bottleneck, a framework for the distributed execution of VEs on a potentially heterogeneous pool of work nodes has been implemented. This framework was named WDVE (WEST distributed virtual experimentation) and is built on top of technologies such as C++, XML and SOAP. It was designed for stability, expandability, performance, platform-independence and ease of use. Complex VEs are most often composed of mutually independent sub-experiments, which can be run concurrently. With WDVE, a complex VE that is executed on a so-called Master machine will therefore attempt to execute its sub-experiments on Slave machines that have previously registered with the Master. The process of submitting requests for the execution of sub-experiments is transparent and involves the transfer of a description of the experiment to be executed, and the resources that are needed for the execution (i.e., model and input data). WDVE is in many ways similar to the Grid Computing paradigm, which is currently receiving widespread attention. However, WDVE is more geared towards application within the scope of water quality management.
The effectiveness of distributed execution of computationally intensive applications (jobs) largely depends on the quality of the applied scheduling approach. However, most of the existing non-trivial scheduling algorithms rely on prior knowledge or on prediction of application parameters, such as execution time, size of input and output, dependencies, etc., to assign applications to the available computational resources. A major issue is that these parameters are hard to determine in advance, especially if the end user does not possess an extensive history of previous application runs. In this work we propose an online method for execution time prediction of applications, for which execution progress can be collected at run-time. Using dynamic progress information, the total job execution time can be predicted using extrapolation. However, the predictions achieved by extrapolation are far from precise and often vary over time as a result of changing application dynamics and varying resource load. Therefore, to compute the actual job execution time we match a number of predefined prediction evolution models against the consecutive extrapolations, by adopting nonlinear curve-fitting. The "best-fit" coefficients allow for more accurate execution time prediction. The predictions made are used to enhance a dynamic scheduling algorithm for workflows introduced in our earlier work. The scheduling algorithm is run with and without curve-fitting, showing a performance improvement of up to 15% in the former case
Abstract.As grids typically consist of autonomously managed subsystems with strongly varying resources, fault-tolerance forms an important aspect of the scheduling process of applications. Two well-known techniques for providing fault-tolerance in grids are periodic task checkpointing and replication. Both techniques mitigate the amount of work lost due to changing system availability but can introduce significant runtime overhead. The latter largely depends on the length of checkpointing interval and the chosen number of replicas, respectively. This paper presents a dynamic scheduling algorithm that switches between periodic checkpointing and replication to exploit the advantages of both techniques and to reduce the overhead. Furthermore, several novel heuristics are discussed that perform on-line adaptive tuning of the checkpointing period based on historical information on resource behavior. Simulationbased comparison of the proposed combined algorithm versus traditional strategies based on checkpointing and replication only, suggests significant reduction of average task makespan for systems with varying load.
Adaptive checkpointing is a relatively new approach that is particularly suitable for providing fault-tolerance in dynamic and unstable grid environments. The approach allows for periodic modification of checkpointing intervals at run-time, when additional information becomes available. In this paper an adaptive algorithm, named MeanFailureCP+, is introduced that deals with checkpointing of grid applications with execution times that are unknown a priori. The algorithm modifies its parameters, based on dynamically collected feedback on its performance. Simulation results show that the new algorithm performs even better than adaptive approaches that make use of exact information on job execution times
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with đź’™ for researchers
Part of the Research Solutions Family.