Towards Data-Flow Parallelization for Adaptive Mesh Refinement Applications

Sala, Kevin; Rico, Alejandro; Beltrán, Vicenç

doi:10.1109/cluster49012.2020.00042

Cited by 5 publications

(8 citation statements)

References 16 publications

(35 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The hybrid variants send/receive/write each boundary block face from a different task (separate messages). That is not the optimal configuration (the optimal is around eight faces per message) but provides very reasonable performance [19] and puts more pressure on the communication phases. We show the throughput speedup of the strong scaling on the upper part of Figure 11 and the parallel efficiency on the lower part.…”

Section: B Miniamrmentioning

confidence: 98%

“…MiniAMR features multiple phases of computation and communication interleaved, and then a refinement and load-balancing phase periodically. Previous works [24] [19] fully taskified its computation and communication phases and some parts of the refinement and load-balancing [19], using OmpSs-2 and TAMPI. We take that taskification as the base and port the communication phases to TAGASPI.…”

Section: B Miniamrmentioning

confidence: 99%

“…We run the following experiments in Marenostrum4; the MPI-only uses 48 ranks/node, and hybrids use 4 ranks/node and 12 cores/rank. That is the optimal configuration for hybrid approaches in miniAMR, given that the refinement phase is not fully taskified [19]. In the TAGASPI variant, we also use TAMPI during the load-balancing stage to demonstrate that both libraries can work together.…”

Section: B Miniamrmentioning

confidence: 99%

“…Firstly, we perform a strong scaling experiment with the same input used in the previous study of miniAMR [19], but we halve the number of computed variables (to 20 variables) to reduce the computational weight. The hybrid variants send/receive/write each boundary block face from a different task (separate messages).…”

Section: B Miniamrmentioning

confidence: 99%

See 3 more Smart Citations

Combining One-Sided Communications with Task-Based Programming Models

Sala

Macià

Beltrán

2021

2021 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

Hybrid programming combining task-based and message-passing models is an increasingly popular technique to exploit multi-core clusters. The Task-Aware MPI (TAMPI) library integrates both models enabling the safe overlap of computation and communication tasks using two-sided MPI communications. Two-sided primitives combine data transfers with implicit synchronizations, but one-sided models usually offer more efficient data transfers decoupling synchronizations. MPI offers four distinct one-sided synchronization modes, while GASPI is a PGAS API providing one-sided operations with remote notifications for fine inter-process synchronizations.In this paper, we study the challenges of integrating MPI and GASPI one-sided operations with the OpenMP and OmpSs-2 tasking models. We propose and implement several extensions to the GASPI and OmpSs-2 programming models, which are leveraged by a new library called Task-Aware GASPI (TAGASPI). The TAGASPI library allows the efficient and safe use of one-sided operations with remote notifications inside tasks. Both TAGASPI and TAMPI transparently manage communications issued by tasks and allow these to overlap with computation tasks naturally, following a data-flow model. These libraries are complementary and can be mixed in the same application.Our experience porting several mini-apps to this hybrid model shows that TAGASPI helps leverage one-sided communications with similar complexity to pure and hybrid two-sided MPI approaches. We show that our hybrid one-sided approach outperforms the pure MPI strategies, but it also surpasses the TAMPI's performance when stressing communication phases, e.g., increasing the communication parallelism and reducing the communication tasks' sizes.

show abstract

Section: B Miniamrmentioning

confidence: 98%

Section: B Miniamrmentioning

confidence: 99%

Section: B Miniamrmentioning

confidence: 99%

Section: B Miniamrmentioning

confidence: 99%

See 2 more Smart Citations

Combining One-Sided Communications with Task-Based Programming Models

Sala

Macià

Beltrán

2021

2021 IEEE International Conference on Cluster Computing (CLUSTER)

Self Cite

View full text Add to dashboard Cite

show abstract

“…The set of experiments we used includes well-known benchmarks such as Cholesky, Dotproduct, MultiSAXPY, STREAM, NBody, NQueens, and the Gauss-Seidel solver for the Heat The benchmarks of this list can be categorized as (i) purely memory-bounded, such as Heat, Dotproduct, MultiSAXPY, and STREAM, (ii) purely compute-bounded, such as NBody and NQueens, and (iii) balanced, such as Cholesky, miniAMR, HPCCG, and LULESH. Furthermore, all of these benchmarks have been parallelized using tasks, as their task-based versions [15], [16], [17] offer competitive or better performance than the fork-join OpenMP counterpart. Finally, the evaluation is partitioned into two phases.…”

Section: A Experimental Setupmentioning

confidence: 99%

Combining Dynamic Concurrency Throttling with Voltage and Frequency Scaling on Task-based Programming Models

Muñoz

Lorenzon

Parra

et al. 2021

50th International Conference on Parallel Processing

View full text Add to dashboard Cite

Being on the verge of exascale performance has shifted the prioritization of performance in applications to the inclusion of power-performance efficiency as a primary objective in the High Performance Computing (HPC) community. Simultaneously, this has surfaced hardware and software efforts that employ techniques such as dynamic voltage and frequency scaling (DVFS) for core and uncore units or dynamic concurrency throttling (DCT) to exploit hardware resources efficiently, by saving energy while maintaining performance. These techniques are complementary, so they can be used together. However, employing them is not a straightforward task, as they have to be adjusted based on the workload, and it is even more complex to combine them properly. Thus, these techniques should be applied transparently by a runtime system, without relying on application developers. In this paper, we extend a task-based runtime system with an infrastructure that categorizes workloads based on their computational profile -memory-bounded, compute-bounded, or balanced. This categorization is done in an on-line manner and with a negligible overhead. With this additional information, we enhance the CPU-manager and scheduler of OmpSs-2, a taskbased parallel programming model, to automatically combine DVFS and DCT techniques based on workloads. Moreover, we show that our heuristics transparently improve energy efficiency on average by 15% with no significant performance loss and either equal or surpass the energy efficiency of the best static configuration available.

show abstract

Mitigating the NUMA effect on task-based runtime systems

et al. 2023

View full text Add to dashboard Cite

Processors with multiple sockets or chiplets are becoming more conventional. These kinds of processors usually expose a single shared address space. However, due to hardware restrictions, they adopt a NUMA approach, where each processor accesses local memory faster than remote memories. Reducing data motion is crucial to improve the overall performance. Thus, computations must run as close as possible to where the data resides. We propose a new approach that mitigates the NUMA effect on NUMA systems. Our solution is based on the OmpSs-2 programming model, a task-based parallel programming model, similar to OpenMP. We first provide a simple API to allocate memory in NUMA systems using different policies. Then, combining user-given information that specifies dependences between tasks, and information collected in a global directory when allocating data, we extend our runtime library to perform NUMA-aware work scheduling. Our heuristic considers data location, distance between NUMA nodes, and the load of each NUMA node to seamlessly minimize data motion costs and load imbalance. Our evaluation shows that our NUMA support can significantly mitigate the NUMA effect by reducing the amount of remote accesses, and so improving performance on most benchmarks, reaching up to 2x speedup in a 2-NUMA machine, and up to 7.1x in a 8-NUMA machine.

show abstract

Towards Data-Flow Parallelization for Adaptive Mesh Refinement Applications

Cited by 5 publications

References 16 publications

Combining One-Sided Communications with Task-Based Programming Models

Combining One-Sided Communications with Task-Based Programming Models

Combining Dynamic Concurrency Throttling with Voltage and Frequency Scaling on Task-based Programming Models

Mitigating the NUMA effect on task-based runtime systems

Contact Info

Product

Resources

About