To improve their performance, scientific applications often use loop scheduling algorithms as techniques for load balancing data parallel computations. Over the years, a number of dynamic loop scheduling (DLS) techniques have been developed. These techniques are based on probabilistic analyses, and are effective in addressing unpredictable load imbalances in the system arising from various sources, such as, variations in application, algorithmic, and systemic characteristics. Modern, high-end computing facilities can now offer petascale performance (10 15 flops), and several initiatives have already begun with the goal of achieving exascale performance (10 18 flops) towards the end of the current decade. Efficient and scalable algorithms are therefore required to utilize the petascale and exascale resources. In this paper, a study of the scalability of DLS techniques via discrete event simulation is presented, both in terms of number of processors, and problem size. To facilitate the scalability study, a dynamic loop scheduler was designed and was implemented using the SimGrid [1] simulation framework. The results of the study demonstrate the scalability of the DLS techniques and their effectiveness in addressing load imbalance in large scale computing systems.
The execution of computationally intensive parallel applications in heterogeneous environments, where the quality and quantity of computing resources available to a single user continuously change, often leads to irregular behavior, in general due to variations of algorithmic and systemic nature. To improve the performance of scientific applications, loop scheduling algorithms are often employed for load balancing of their parallel loops. However, it is a challenge to select the most robust scheduling algorithms for guaranteeing optimized performance of scientific applications on large-scale computing systems that comprise resources which are widely distributed, highly heterogeneous, often shared among multiple users, and have computing availabilities that cannot always be guaranteed or predicted. To address this challenge, in this work we focus on a portfolio-based approach to enable the dynamic selection and use of the most robust dynamic loop scheduling (DLS) algorithm from a portfolio of DLS algorithms, depending on the given application and current system characteristics including workload conditions. Thus, in this paper we provide a solution to the algorithm selection problem and experimentally evaluate its quality. We propose the use of supervised machine learning techniques to build empirical robustness prediction models that are used to predict DLS algorithm's robustness for given scientific application characteristics and system availabilities. Using simulated scientific applications characteristics and system availabilities, along with empirical robustness prediction models, we show that the proposed portfolio-based approach enables the selection of the most robust DLS algorithm that satisfies a user-specified tolerance on the given application's performance obtained in the particular computing system with a certain variable availability. We also show that the portfoliobased approach offers higher guarantees regarding the robust performance of the application using the automatically selected DLS algorithms when compared to the robust performance of the same application using a manually selected DLS algorithm.
Scientific applications running on heterogeneous computing systems, which often have unpredictable behavior, enhance their performance by employing loop scheduling techniques as methods to avoid load imbalance through an optimized assignment of their parallel loops. With current computing platforms facilitating petascale performance and promising exascale performance towards the end of the present decade, efficient and robust algorithms are required to guarantee optimal performance of parallel applications in the presence of unpredictable perturbations. A number of dynamic loop scheduling (DLS) methods based on probabilistic analyses have been developed to achieve the desired robust performance. In earlier work, two metrics (flexibility and resilience) have been formulated to quantify the robustness of various DLS methods in heterogeneous computing systems with uncertainties. In this work, to ensure robust performance of the scientific applications on current (petascale) and future (exascale) high performance computing systems, a simulation model was designed and integrated into the SimGrid simulation toolkit, thus enabling a comprehensive study of the robustness of the DLS methods which uses results of experimental cases with various combinations of number of processors, problem sizes, and scheduling methods. The DLS methods have been implemented into the simulation model and analyzed for the purpose of exploring their flexibility (robustness against unpredictable variations in the system load), when involved in a range of case scenarios comprised of various distributions characterizing loop iteration execution times and system availability. The simulation results reported are used to compare the robustness of the DLS methods under the various environments considered, using the flexibility metric.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.