Alexandre Santana scite author profile

Castro

et al. 2018

Effectively mapping tasks of High Performance Computing (HPC) applications on parallel systems is crucial to assure substantial performance gains. As platforms and applications grow, load imbalance becomes a priority issue. Even though centralized rescheduling has been a viable solution to mitigate this problem, its efficiency is not able to keep up with the increasing size of shared memory platforms. To efficiently solve load imbalance today, and in the years to come, we should prioritize decentralized strategies developed for large scale platforms. In this paper, we propose our Batch Task Migration approach to improve decentralized global rescheduling, ultimately reducing communication costs and preserving task locality. We implemented and evaluated our approach in two different parallel platforms, using both synthetic workloads and a molecular dynamics (MD) benchmark. Our solution was able to achieve speedups of up to 3.75 and 1.15 on rescheduling time, when compared to other centralized and distributed approaches, respectively. Moreover, it improved the execution time of MD by factors up to 1.34 and 1.22 when compared to a scenario without load balancing on two different platforms.

ARTful: A model for user‐defined schedulers targeting multiple high‐performance computing runtime systems

et al. 2021

Global schedulers are components in parallel runtime libraries that distribute the application's workload across physical resources. More often than not, applications showcase dynamic load imbalance and require customized scheduling solutions to avoid wasting resources. Some libraries lack support for user‐defined schedulers and developers resort to unofficial extensions that are harder to reuse and maintain. We propose a global scheduler software design, entitled ARTful model, to create user‐defined solutions with minimal alterations in the runtime library. Our model uses a component‐based design to separate components from the runtime library and the scheduling policy implementation. The ARTful modeldescribes the interface of a portable scheduler library, allowing policies to operate on different runtime libraries. We study the overhead induced by our design through our ARTful library implementation metaprogramming‐oriented global scheduling library using workload‐aware scheduling policies. We experiment with two different policies from OpenMP and Charm++ runtime systems, also presenting evaluations of the policies outside of their original library context. We observe that our portable schedulers can sometimes perform decisions faster than their native counterparts with negligible overhead in the execution times of synthetic applications and molecular dynamics kernels.

Reducing Global Schedulers Complexity through Runtime System Decoupling

Pilla

et al. 2018

Global schedulers are components used in parallel solutions, specially in dynamic applications, to optimize resource usage. Nonetheless, their development is a cumbersome process due to necessary adaptations to cope with the programming interfaces and abstractions of runtime systems. This paper proposes a model to dissociate schedulers from runtime systems to lower software complexity. Our model is based on the scheduler breakdown into modular and reusable concepts that better express the scheduler requirements. Through the use of meta-programming and design patterns, we were able to achieve fully reusable workload-aware scheduling strategies with up to 63% fewer lines of code with negligible run time overhead.

PackStealLB: A scalable distributed load balancer based on work stealing and workload discretization

Pilla

Journal of Parallel and Distributed Computing

et al. 2021

The scalability of high-performance, parallel iterative applications is directly affected by how well they use the available computing resources. These applications are subject to load imbalance due to the nature and dynamics of their computations. It is common that high performance systems employ periodic load balancing to tackle this issue. Dynamic load balancing algorithms redistribute the application's workload using heuristics to circumvent the NP-hard complexity of the problem However, scheduling heuristics must be fast to avoid hindering application performance when distributing the workload on large and distributed environments. In this work, we present a technique for low overhead, high quality scheduling decisions for parallel iterative applications. The technique relies on combined application workload information paired with distributed scheduling algorithms. An initial distributed step among scheduling agents group application tasks in packs of similar load to minimize messages among them. This information is used by our scheduling algorithm, Pack-StealLB, for its distributed-memory work stealing heuristic. Experimental results showed that PackStealLB is able to improve the performance of a molecular dynamics benchmark by up to 41%, outperforming other scheduling algorithms in most scenarios over almost one thousand cores.

Distributed Memory Graph Representation for Load Balancing Data: Accelerating Data Structure Generation for Decentralized Scheduling

Castro

et al. 2019

In this paper, we propose a Distributed Graph Model (DGM) and data structure to enable communicationaware heuristics in distributed load balancers (LBs). DGM is motivated by the desire to maintain and use information related to the affinity between tasks (their communication) in order to improve data locality while scheduling tasks in a distributed fashion to avoid the centralization overhead. Results show that DGM is able to achieve speedups of up to 50.4x with 40 virtual cores, when compared to a centralized graph representation with the same purpose. Additionally, we propose a proofof-concept distributed scheduler that uses DGM, named Edge Migration, and its implementation in the Charm++ parallel programming model. These results show that, although the communication analysis is much faster with DGM, it is still the most relevant overhead in distributed LBs. We also observe that Edge Migration has a decision time in the same order of magnitude as other communication-unaware decentralized algorithms. Thus, DGM can be used in communication-aware distributed LBs to improve load balancing decisions with a small impact in the overall LB performance.