Ahmed Eleliemy scite author profile

Mohammed

2017

Abstract-Over the last decades, the high performance computing (HPC) systems underwent a significant increase in their processing capabilities. Modern HPC systems combine very large numbers of homogeneous and heterogeneous computing resources. Scalability is, therefore, an important aspect of scientific applications to efficiently exploit the massive parallelism and computing power of modern HPC systems. This work introduces a scalable version of the parallel spin-image algorithm (PSIA), called APSIA. The PSIA is a parallel version of the well known spin-image algorithm (SIA). The (P)SIA is used in various domains, such as 3D object recognition, categorization, and 3D face recognition. To the best of our knowledge, the scalability of the PSIA has not yet been studied. APSIA refers to the extended version of the PSIA that integrates various well known dynamic loop scheduling (DLS) techniques. Through loop scheduling and dynamic load balancing, this integration enables an improved and scalable execution of the PSIA on homogeneous and heterogeneous HPC systems. The present work: (1) Proposes APSIA, a novel flexible and scalable version of PSIA; (2) Showcases the benefits of applying DLS techniques for optimizing the performance of the PSIA; (3) Assesses the performance of the proposed APSIA by conducting several scalability experiments on more than 300 heterogeneous computing cores. The performance results are promising and show that using well known DLS techniques, the performance of the APSIA outperforms the performance of the PSIA by a factor of 1.2 and 2 for homogeneous and heterogeneous computing resources, respectively.

Loadbalancing on Parallel Heterogeneous Architectures: Spin-image Algorithm on CPU and MIC

Fayze

Mehmood

et al. 2018

Loadbalancing of computational tasks over heterogeneous architectures is an area of paramount importance due to the growing heterogeneity of HPC platforms and the higher performance and energy efficiency they could offer. This paper aims to address this challenge for a heterogeneous platform comprising Intel Xeon multi-core processors and Intel Xeon Phi accelerators (MIC) using an empirical approach. The proposed approach is investigated through a case study of the spin-image algorithm, selected due to its computationally intensive nature and a wide range of applications including 3D database retrieval systems and object recognition. The contributions of this paper are threefold. Firstly, we introduce a parallel spin-image algorithm (PSIA) that achieves a speedup of 19.8 on 24 CPU cores. Secondly, we provide results for a hybrid implementation of PSIA for a heterogeneous platform comprising CPU and MIC: to the best of our knowledge, this is the first such heterogeneous implementation of the spin-image algorithm. Thirdly, we use a range of 3D objects to empirically find a strategy to loadbalance computations between the MIC and CPU cores, achieving speedups of up to 32.4 over the sequential version. The LIRIS 3D mesh watermarking dataset is used to investigate performance analysis and optimization.

Experimental Verification and Analysis of Dynamic Loop Scheduling in Scientific Applications

Mohammed

et al. 2018

Scientific applications are often irregular and characterized by large computationally-intensive parallel loops. Dynamic loop scheduling (DLS) techniques can be used to improve the performance of computationally-intensive scientific applications via load balancing of their execution on high-performance computing (HPC) systems. Identifying the most suitable choices of data distribution strategies, system sizes, and DLS techniques which improve the performance of a given application, requires intensive assessment and a large number of exploratory native experiments (using real applications on real systems), which may not always be feasible or practical due to associated time and costs. In such cases, to avoid the execution of a large volume of exploratory native experiments, simulative experiments which are faster and less costly are more appropriate for studying the performance of applications for the purpose of optimizing it. This motivates the question of 'How realistic are the simulations of executions of scientific applications using DLS on HPC platforms?' In the present work, a methodology is devised to answer this question It involves the experimental verification and analysis of the performance of DLS in scientific applications. The proposed methodology is employed for a computer vision application executing using four DLS techniques on two different HPC platforms, both via native and simulative experiments. Moreover, the evaluation and the analysis of the native and simulative results indicate that the accuracy of the simulative experiments is strongly influenced by the values obtained by the chosen approach used to extract the computational effort of the application (FLOP-or time-based), the choice of application model representation into simulation (data or task parallel), and the choice of HPC subsystem models available in the simulator (multi-core CPUs, memory hierarchy, and network topology). Further insights into the native performance on two HPC platforms one versus the other, the simulated performance using the two SimGrid interfaces, one versus the other, and the native versus the simulated performance for each of the simulated HPC platforms, are also presented and discussed. The minimum and the maximum percent errors achieved between native and simulative experiments are 0.95% and 8.03%, respectively.

Towards the Reproduction of Selected Dynamic Loop Scheduling Experiments Using SimGrid-SimDag

Mohammed

2017

Modern computing architectures exhibit increasing parallelism. Therefore, dynamic loop scheduling (DLS) plays an increasing role in the performance optimization of parallel applications executing on the modern computing architectures. In the previous decades, there was a large body of research concerning DLS techniques. Reproduction of the DLS experiments is significant for ensuring the trustworthiness of the DLS techniques implementation in modern scheduling tools or within new scientific applications. The results of executing the implemented DLS techniques are expected to be in agreement with the results reported in earlier work. The present work is a step towards the reproduction of the experiments that introduced the well-known DLS technique named factoring (FAC). Studying scheduling techniques via simulation is favorable compared to native execution to have control over all the factors that may affect the performance. The use of simulation in this work is essential for the reproduction of the scheduling experiments performed on computing systems that no longer exist. This work shows that the self scheduling technique with matrix multiplication kernel has a significantly poorer performance on the modern system considered in this study than on the past system.

Dynamic Loop Scheduling Using MPI Passive-Target Remote Memory Access

2019

Scientific applications often contain large computationally-intensive parallel loops. Loop scheduling techniques aim to achieve load balanced executions of such applications. For distributed-memory systems, existing dynamic loop scheduling (DLS) libraries are typically MPI-based, and employ a master-worker execution model to assign variably-sized chunks of loop iterations. The master-worker execution model may adversely impact performance due to the master-level contention. This work proposes a distributed chunk-calculation approach that does not require the master-worker execution scheme. Moreover, it considers the novel features in the latest MPI standards, such as passive-target remote memory access, shared-memory window creation, and atomic read-modify-write operations. To evaluate the proposed approach, five well-known DLS techniques, two applications, and two heterogeneous hardware setups have been considered. The DLS techniques implemented using the proposed approach outperformed their counterparts implemented using the traditional master-worker execution model.