Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Kaiser

Ramanujam

2017

Self Cite

Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalances and waiting time due to memory latencies. Compiler optimization is one of the most effective solutions to tackle this problem. The compiler is able to detect the data dependencies in an application and is able to analyze the specific sections of code for parallelization potential. However, all of these techniques provided with a compiler are usually applied at compile time, so they rely on static analysis, which is insufficient for achieving maximum parallelism and producing desired application scalability. One solution to address this challenge is the use of runtime methods. This strategy can be implemented by delaying certain amount of code analysis to be done at runtime.In this research, we improve the parallel application performance generated by the OP2 compiler by leveraging HPX, a C++ runtime system, to provide runtime optimizations. These optimizations include asynchronous tasking, loop interleaving, dynamic chunk sizing, and data prefetching. The results of the research were evaluated using an Airfoil application which showed a 40 − 50% improvement in parallel performance.

Section: B Airfoil Applicationmentioning

confidence: 99%

Section: Hpxmentioning

confidence: 99%

Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Kaiser

Ramanujam

2017

Self Cite

“…Chunk size is the amount of work performed by each task [12,13] that is determined by an auto_partitioner exposed by the HPX algorithms or is passed by using static/dynamic_chunk_size as an execution policy's parameter [10]. However, (1) the experimental results in [4] and [3] showed that the overheads of determining chunk size by using the auto_partitioner negatively effected the application's scalability in some cases; (2) the policy written by the user will often not be able to determine the optimum chunk size either due to the limit of runtime information. • In [14], we proposed the HPX prefetching method which aids prefetching that not only reduces the memory accesses latency, but also relaxes the global barrier.…”

Section: Introductionmentioning

confidence: 99%

“…While runtime adaptive methods have been shown to be very effective -especially for highly dynamic scenarios -solely relying on them doesn't guarantee maximal parallel performance, since the performance of an application depends on both the values measured at runtime and the related transformations performed at compile time. Collecting the outcome of the static analysis performed by the compiler could significantly improve runtime decisions and therefore application performance [1][2][3][4].…”

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

HPX Smart Executors

Proceedings of the Third International Workshop on Extreme Scale Programming Models and Middleware

Troska

Kaiser

et al. 2017

Self Cite

The performance of many parallel applications depends on looplevel parallelism. However, manually parallelizing all loops may result in degrading parallel performance, as some of them cannot scale desirably to a large number of threads. In addition, the overheads of manually tuning loop parameters might prevent an application from reaching its maximum parallel performance. We illustrate how machine learning techniques can be applied to address these challenges. In this research, we develop a framework that is able to automatically capture the static and dynamic information of a loop. Moreover, we advocate a novel method by introducing HPX smart executors for determining the execution policy, chunk size, and prefetching distance of an HPX loop to achieve higher possible performance by feeding static information captured during compilation and runtime-based dynamic information to our learning model. Our evaluated execution results show that using these smart executors can speed up the HPX execution process by around 12% − 35% for the Matrix Multiplication, Stream and 2D Stencil benchmarks compared to setting their HPX loop's execution policy/parameters manually or using HPX auto-parallelization techniques.

A Load-Balanced Parallel and Distributed Sorting Algorithm Implemented with PGX.D

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Lee

Depner

et al. 2017

Self Cite

Sorting has been one of the most challenging studied problems in different scientific researches. Although many techniques and algorithms have been proposed on the theory of having efficient parallel sorting implementation, however achieving desired performance on different types of the architectures with large number of processors is still a challenging issue. Maximizing parallelism level in applications can be achieved by minimizing overheads due to load imbalance and waiting time due to memory latencies. In this paper, we present a distributed sorting algorithm implemented in PGX.D, a fast distributed graph processing system, which outperforms the Spark's distributed sorting implementation by around 2x-3x by hiding communication latencies and minimizing unnecessary overheads. Furthermore, it shows that the proposed PGX.D sorting method handles dataset containing many duplicated data entries efficiently and always results in keeping balanced workloads for different input data distribution types.Index Terms-Distributed sorting method, PGX.D distributed graph framework, Graph. * This work was done during the author's internship at Oracle Labs.In this research, we propose a new distributed sorting method, which overcomes these challenges by keeping balanced load and minimizes the overheads by fetching data efficiently in the partitioning and merging steps. The new handler is proposed that results in having a balanced merging while parallelizing merging steps, which improves the parallel performance. Moreover, the new investigator is proposed that results in keeping a balanced workloads among the distributed processors while dealing with dataset containing many duplicated data entries. This method is implemented in PGX.D, which is a scalable framework for various distributed implementations. PGX.D [7], [8] is a fast, parallel and distributed graph analytic framework that is able to process large graphs in distributed environments while keeping workloads well balanced among distributed machines. It improves the performance of the proposed sorting technique by exposing programming model that intrinsically reduces poor utilization of the resources by maintaining balanced workloads, minimizes latencies by managing parallel tasks efficiently and provides asynchronous task execution for sending/receiving data to/from the remote processors. The results presented in [7] show that PGX.D has low overhead and a bandwidth efficient communication framework, which easily supports remote data pulling patterns and is about 3x-90x faster than the other distributed graph systems such as GraphLab. Moreover, PGX.D decreases communication overheads by delaying unnecessary computations until the end of the current step, which allows the other processes to be continued without waiting for the completion of all the previous computations. Also it allows having asynchronous local and remote requests that avoids unnecessary synchronization barriers that helps in increasing scalability of the distributed sorting method [9].In this paper, we s...