2019
DOI: 10.1016/j.parco.2019.102582
|View full text |Cite
|
Sign up to set email alerts
|

DuctTeip: An efficient programming model for distributed task-based parallel computing

Abstract: Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task parallel programming to distributed memory systems. We u… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
15
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
3
2
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 18 publications
(16 citation statements)
references
References 37 publications
1
15
0
Order By: Relevance
“…We found Dask to provide a good balance between simplicity, portability and performance, and chose to use it to implement the orchestrator that ships with Orchestral. There are however many alternatives available [4], [29], [5], [34], [23], [33], [18], which could potentially help us to push performance for lowlatency requirements on distributed infrastructures. In fact, Orchestral could be used to create a comparative benchmark of all these libraries.…”
Section: Discussionmentioning
confidence: 99%
“…We found Dask to provide a good balance between simplicity, portability and performance, and chose to use it to implement the orchestrator that ships with Orchestral. There are however many alternatives available [4], [29], [5], [34], [23], [33], [18], which could potentially help us to push performance for lowlatency requirements on distributed infrastructures. In fact, Orchestral could be used to create a comparative benchmark of all these libraries.…”
Section: Discussionmentioning
confidence: 99%
“…Given these limitations on the widely used structured SWAN grid approach, SWAN grids will almost exclusively be deemed as a low spatial computational demand model. Small tasks create a sharp drop in performance via the Intel C++ compiler due to the "work stealing" algorithm, aimed at balancing out the computational load between threads (Zafari et al, 2019) . In this scenario, the threads compete against each other resulting in an unproductive simulation.…”
Section: Methodology and Backgroundmentioning
confidence: 99%
“…All of the operations in the NESA algorithm are dense matrix-vector products, with the same computational intensity of 2 flop/double. For modern multicore architectures, a computational intensity of 30-40 is needed in order to balance bandwidth capacity and floating point performance, see for example the trade-offs for the Tintin and Rackham systems at UPPMAX, Uppsala University, calculated in [64]. This means that we need to exploit data locality (work on data that is cached locally) in order to overcome bandwidth limitations and scale to the full number of available cores.…”
Section: Specific Properties Of the Nesa Algorithmmentioning
confidence: 99%
“…The ongoing trend in cluster hardware is an increasing number of cores per computational node. When scaling to large numbers of cores, it is hard to fully exploit the computational resources using a pure MPI implementation, due to the rapid increase in the number of inter-node messages with the number of MPI processes for communication heavy algorithms [64]. As is pointed out in [35], a hybrid parallelization with MPI at the distributed level and threads within the computational nodes is more likely to perform well.…”
Section: State Of the Artmentioning
confidence: 99%
See 1 more Smart Citation