Loadbalancing on Parallel Heterogeneous Architectures: Spin-image Algorithm on CPU and MIC

Eleliemy, Ahmed; Fayze, Mahmoud; Mehmood, Rashid; Katib, Iyad; Aljohani, Naif Radi

doi:10.3384/ecp17142673

Cited by 10 publications

(25 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This work studies load imbalance at the thread level and the process level in 3 scientific applications; PSIA [14], Mandelbrot [13], and SPHYNX [15]. The eLaPeSD is used to balance the load at the thread level and the DLS4LB [2] to support the AWF DLS technique (that supports time-stepping applications, such as SPHYNX in this work), to balance the load at the process level.…”

Section: Related Workmentioning

confidence: 99%

“…Line 9 represents the main source of load imbalance, as the number of repetitions of the calculations between Lines 9 to 14 is irregular. The second application of interest is an application from the computer vision domain, namely the parallel spin-image algorithm (PSIA) [14]. PSIA converts a 3D object into a set of 2D descriptors (spin-images).…”

Section: Applicationsmentioning

confidence: 99%

See 1 more Smart Citation

Two-level Dynamic Load Balancing for High Performance Scientific Applications

Mohammed¹,

Cavelan²,

Ciorba³

et al. 2020

Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing

View full text Add to dashboard Cite

Scientific applications are often complex, irregular, and computationally-intensive. To accommodate the ever-increasing computational demands of scientific applications, high performance computing (HPC) systems have become larger and more complex, offering parallelism at multiple levels (e.g., nodes, cores per node, threads per core). Scientific applications need to exploit all the available multilevel hardware parallelism to harness the available computational power. The performance of applications executing on such HPC systems may adversely be affected by load imbalance at multiple levels, caused by problem, algorithmic, and systemic characteristics. Nevertheless, most existing load balancing methods do not simultaneously address load imbalance at multiple levels. This work investigates the impact of load imbalance on the performance of three scientific applications at the thread and process levels. We jointly apply and evaluate selected dynamic loop self-scheduling (DLS) techniques to both levels. Specifically, we employ the extended LaPeSD OpenMP runtime library [1] at the thread level, and extend the DLS4LB MPI-based dynamic load balancing library [2] at the process level. This approach is generic and applicable to any multiprocess-multithreaded computationally-intensive application (programmed using MPI and OpenMP). We conduct an exhaustive set of experiments to assess and compare six DLS techniques at the thread level and eleven at the process level. The results show that improved application performance, by up to 21%, can only be achieved by jointly addressing load imbalance at the two levels. We offer insights into the performance of the selected DLS techniques and discuss the interplay of load balancing at the thread level and process level.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Applicationsmentioning

confidence: 99%

Two-level Dynamic Load Balancing for High Performance Scientific Applications

Mohammed¹,

Cavelan²,

Ciorba³

et al. 2020

Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing

View full text Add to dashboard Cite

show abstract

“…Selected Applications Two computationally-intensive parallel applications are considered in this study. The first application, called PSIA [27], uses a parallel version of the well-known spin-image algorithm (SIA) [28]. SIA converts a 3D object into a set of 2D images.…”

Section: Design and Setup Of Experimentsmentioning

confidence: 99%

Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

Eleliemy

Ciorba

2019

2019 27th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP)

Self Cite

View full text Add to dashboard Cite

Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model.

show abstract

“…Selected Applications: Two scientific applications are used to assess and compare the performance of the proposed MPI+MPI hierarchical DLS approach: PSIA [32,33] and Mandelbrot [34]. PSIA is a parallel version of the spin-image algorithm (SIA), which converts a 3D object into a set of 2D images [35].…”

Section: Methodsmentioning

confidence: 99%

Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

Eleliemy

Ciorba

2019

2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Self Cite

View full text Add to dashboard Cite

Computationally-intensive loops are the primary source of parallelism in scientific applications. Such loops are often irregular and a balanced execution of their loop iterations is critical for achieving high performance. However, several factors may lead to an imbalanced load execution, such as problem characteristics, algorithmic, and systemic variations. Dynamic loop self-scheduling (DLS) techniques are devised to mitigate these factors, and consequently, improve application performance. On distributed-memory systems, DLS techniques can be implemented using a hierarchical master-worker execution model and are, therefore, called hierarchical DLS techniques. These techniques self-schedule loop iterations at two levels of hardware parallelism: across and within compute nodes. Hybrid programming approaches that combine the message passing interface (MPI) with open multi-processing (OpenMP) dominate the implementation of hierarchical DLS techniques. The MPI-3 standard includes the feature of sharing memory regions among MPI processes. This feature introduced the MPI+MPI approach that simplifies the implementation of parallel scientific applications. The present work designs and implements hierarchical DLS techniques by exploiting the MPI+MPI approach. Four well-known DLS techniques are considered in the evaluation proposed herein. The results indicate certain performance advantages of the proposed approach compared to the hybrid MPI+OpenMP approach.

show abstract

Loadbalancing on Parallel Heterogeneous Architectures: Spin-image Algorithm on CPU and MIC

Cited by 10 publications

References 11 publications

Two-level Dynamic Load Balancing for High Performance Scientific Applications

Two-level Dynamic Load Balancing for High Performance Scientific Applications

Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

Hierarchical Dynamic Loop Self-Scheduling on Distributed-Memory Systems Using an MPI+MPI Approach

Contact Info

Product

Resources

About