Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Marjanovic, Vladimir; Labarta, Jesús; Ayguadé, Eduard; Valero, Mateo

doi:10.1145/1810085.1810091

Cited by 71 publications

(61 citation statements)

References 15 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the case of the distributed OmpSs+MPI model, it combines dataflow execution with the message passing model providing significant performance benefits. It hides the communication latencies and achieves higher performance compared to MPI only model [25].…”

Section: Task Replicationmentioning

confidence: 99%

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

Subaşi

Yalcin

Zyulkyarov

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

Abstract-In this paper we propose a runtime-based selective task replication technique for task-parallel high performance computing applications. Our selective task replication technique is automatic and does not require modification/recompilation of OS, compiler or application code. Our heuristic, we call App FIT, selects tasks to replicate such that the specified reliability target for an application is achieved. In our experimental evaluation, we show that App FIT selective replication heuristic is low-overhead and highly scalable. In addition, results indicate that complete task replication is overkill for achieving reliability targets. We show that with App FIT, we can tolerate pessimistic exascale error rates with only 53% of the tasks being replicated.

show abstract

Section: Task Replicationmentioning

confidence: 99%

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

Subaşi

Yalcin

Zyulkyarov

et al. 2016

2016 IEEE International Conference on Cluster Computing (CLUSTER)

View full text Add to dashboard Cite

show abstract

“…State-of-the-art techniques that combine distributed-and shared-memory programming models [80], as well as many PGAS approaches [6,24,47,48], have demon-strated the potential benefits of combining both levels of parallelism [81,82,39,83], including increased communication-computation overlap [84,85], improved memory utilization [86,87], power optimization [88] and effective use of accelerators [89,90,91,92]. The hybrid MPI and thread model, such as MPI and OpenMP, can take advantage of those optimized shared-memory algorithms and data structures.…”

Section: Chapter 4 Habanero-c Runtime Communication Systemmentioning

confidence: 99%

Runtime Systems for Extreme Scale Platforms

Chatterjee¹

2013

View full text Add to dashboard Cite

Future extreme-scale systems are expected to contain homogeneous and heterogeneous many-core processors, with O(10 3 ) cores per node and O(10 6 ) nodes overall.Effective combination of inter-node and intra-node parallelism is recognized to be a major software challenge for such systems. Further, applications will have to deal with constrained energy budgets as well as frequent faults and failures. To aid programmers manage these complexities and enhance programmability, much of recent research has focussed on designing state-of-art software runtime systems. Such runtime systems are expected to be a critical component of the software ecosystem for the management of parallelism, locality, load balancing, energy and resilience on extreme-scale systems.In this dissertation, we address three key challenges faced by a runtime system using a dynamic task parallel framework for extreme-scale computing. First, we address the challenge of integrating an intra-node task parallel runtime with a communication system for scalable performance. We present a runtime communication system, called HC-COMM, designed to use dedicated communication cores on a system. We introduce the HCMPI programming model which integrates the Habanero-C asynchronous dynamic task parallel language with the MPI message passing communication model on the HC-COMM runtime. We also introduce the HAPGNS model that enables data flow programming for extreme-scale systems in which the user does not require knowledge of MPI. Second, we address the challenge of separating locality optimizations from a programmer with domain specific knowledge. We present a tuning framework, through which performance experts can optimize existing applications by specifying runtime operations aimed at co-scheduling of affinitized tasks. Finally, we address the challenge of scalable synchronization for long running tasks on a dynamic task parallel runtime. We use the phaser construct to present a generalized tree-based synchronization algorithm and support unified collective operations at both inter-node and intra-node levels. Overcoming these runtime challenges are a first step towards effective programming on extreme-scale systems. AcknowledgmentsIt was an honor and a gift to have had Prof. Vivek Sarkar as my PhD advisor.Working with him has been a truly great learning experience for me.

show abstract

“…In general, the system ensures high utilization, since some blocks on each processor should always have work. Others implementations dynamically interleave the work performed on various blocks, either by introducing task parallelism to HPL [25] or by spawning many light-weight threads in UPC [16].…”

Section: B Programming Paradigmmentioning

confidence: 99%

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Lifflander

Miller

Venkataraman

et al. 2012

2012 IEEE 26th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-Dense LU factorization is a prominent benchmark used to rank the performance of supercomputers. Many implementations use block-cyclic distributions of matrix blocks onto a two-dimensional process grid. The process grid dimensions drive a trade-off between communication and computation and are architecture-and implementation-sensitive. The critical panel factorization steps can be made less communication-bound by overlapping asynchronous collectives for pivoting with the computation of rank-k updates. By shifting the computationcommunication trade-off, a modified block-cyclic distribution can beneficially exploit more available parallelism on the critical path, and reduce panel factorization's memory hierarchy contention on now-ubiquitous multicore architectures.During active panel factorization, rank-1 updates stream through memory with minimal reuse. In a column-major process grid, the performance of this access pattern degrades as too many streaming processors contend for access to memory. A blockcyclic mapping in the row-major order does not encounter this problem, but consequently sacrifices node and network locality in the critical pivoting steps. We introduce striding to vary between the two extremes of row-and column-major process grids.The maximum available parallelism in the critical path work (active panel factorization, triangular solves, and subsequent broadcasts) is bounded by the length or width of the process grid. Increasing one dimension of the process grid decreases the number of distinct processes and nodes in the other dimension. To increase the harnessed parallelism in both dimensions, we start with a tall process grid. We then apply periodic rotation to this grid to restore exploited parallelism along the row to previous levels.As a test-bed for further mapping experiments, we describe a dense LU implementation that allows a block distribution to be defined as a general function of block to processor. Other mappings can be tested with only small, local changes to the code.

show abstract

Overlapping communication and computation by using a hybrid MPI/SMPSs approach

Abstract: -

Cited by 71 publications

References 15 publications

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

A Runtime Heuristic to Selectively Replicate Tasks for Application-Specific Reliability Targets

Runtime Systems for Extreme Scale Platforms

Mapping Dense LU Factorization on Multicore Supercomputer Nodes

Contact Info

Product

Resources

About