ClusterSs

Tejedor, Enric; Farreras, Montse; Grove, David; Badía, Rosa M.; Almasi, Gheorghe; Labarta, Jesús

doi:10.1145/1996130.1996168

Cited by 15 publications

(2 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When we evaluate the performance of DuctTeip, we compare with the three major task frameworks/run-times with support for distributed memory architectures that implement a similar programming model: StarPU [11], Cluster OmpSS [12], and PaRSEC [13]. We introduce each in some detail below.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

DuctTeip: An efficient programming model for distributed task-based parallel computing

2019

View full text Add to dashboard Cite

Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task parallel programming, and show that it is competitive. arXiv:1801.03578v1 [cs.DC] 10 Jan 2018 to the computational work performed by one node, the 1×9 process grid has the smallest variance between nodes, and therefore also the lowest maximum work size. The 9 × 1 process grid leads to smaller maximum work size than the 3 × 3 process grid if B is large enough, but suffers from significant load imbalance in the case B = 18. In all cases, the work becomes more evenly distributed if the number of level 1 tasks B is larger. The statistics for communication and computation point in different directions, but when comparing with actual run times, we have found that the communication size is the most informative measure. Having a large total communication size is likely to be detrimental to performance as the risk of tasks left waiting for remote data increases as well as the risk of congestion of messages. A square process grid is the factor that has the largest impact. Regarding the block sizes, having a large B improves the load balance, but increases the amount of communication as well as the number of messages (another indicator that is not shown in the graphics).

show abstract

Section: Introductionmentioning

confidence: 99%

“…Cluster OmpSs [12] employs centralized task creation and submission with one master process that submits tasks to all other processes. A task can generate child tasks at execution time.…”

Section: Introductionmentioning

confidence: 99%