Communicating Multiprocessor-Tasks

Dümmler, Jörg; Rauber, Thomas; Rünger, Gudula

doi:10.1007/978-3-540-85261-2_20

Cited by 4 publications

(4 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The dependencies can be given by the names of the variables causing the input-output relations. Such a tool is the TwoL compiler [43] or its successor, the CM-task compiler [17], in which the coordination structure can be expressed using appropriate coordination constructs. The resulting specification program is not executable, but is translated into a final C with MPI program.…”

Section: Specification Of M-task Programsmentioning

confidence: 99%

“…The M-task step computes a single micro step, and the M-task combine is responsible for the determination of the final approximation vector of a time step and for the computation of the time index and step size for the next time step. Figure 3 shows a specification program of the composed M-task EPOL for the CM-task compiler [17]. The module expression (lines 9-16) defines possible execution orders of the M-tasks.…”

Section: Examplementioning

confidence: 99%

“…This section describes experimental results obtained by applying the combined scheduling and mapping algorithm to different application programs. The benchmarked program versions are generated by the CMtask compiler [17], which takes the specification of the M-task structure of an application as input, see Fig. 3 for an example.…”

Section: Experimental Evaluationmentioning

confidence: 99%

See 2 more Smart Citations

Combined Scheduling and Mapping for Scalable Computing with Parallel Tasks

Dümmler

Rauber

Rünger

2012

Scientific Programming

Self Cite

View full text Add to dashboard Cite

Recent and future parallel clusters and supercomputers use symmetric multiprocessors (SMPs) and multi-core processors as basic nodes, providing a huge amount of parallel resources. These systems often have hierarchically structured interconnection networks combining computing resources at different levels, starting with the interconnect within multi-core processors up to the interconnection network combining nodes of the cluster or supercomputer. The challenge for the programmer is that these computing resources should be utilized efficiently by exploiting the available degree of parallelism of the application program and by structuring the application in a way which is sensitive to the heterogeneous interconnect. In this article, we pursue a parallel programming method using parallel tasks to structure parallel implementations. A parallel task can be executed by multiple processors or cores and, for each activation of a parallel task, the actual number of executing cores can be adapted to the specific execution situation. In particular, we propose a new combined scheduling and mapping technique for parallel tasks with dependencies that takes the hierarchical structure of modern multi-core clusters into account. An experimental evaluation shows that the presented programming approach can lead to a significantly higher performance compared to standard data parallel implementations.

show abstract

Section: Specification Of M-task Programsmentioning

confidence: 99%

Section: Examplementioning

confidence: 99%

See 1 more Smart Citation

Combined Scheduling and Mapping for Scalable Computing with Parallel Tasks

Dümmler

Rauber

Rünger

2012

Scientific Programming

Self Cite

View full text Add to dashboard Cite

show abstract

“…This limitation has often been addressed and frameworks which allow communicating tasks have been proposed, see e.g. [8]. Phasers as known from Habanero [16] allow loose synchronization of single-threaded tasks.…”

Section: Introductionmentioning

confidence: 99%

Work-stealing for mixed-mode parallelism by deterministic team-building

Wimmer

Träff

2011

Proceedings of the Twenty-Third Annual ACM Symposium on Parallelism in Algorithms and Architectures

View full text Add to dashboard Cite

We show how to extend classical work-stealing to deal also with data parallel tasks that can require any number of threads r ≥ 1 for their execution. As threads become idle they attempt to join a team of threads designated for a task requiring r > 1 threads for its execution. Team building is done following a deterministic pattern involving log p possibly randomized steal attempts where p is the number of started hardware threads. Deterministic workstealing often exhibits good locality properties that are desirable to preserve. Threads attempting to join the team for a task requiring a large team may help smaller teams instead of waiting for the large team to form. We explain in detail the so introduced idea of work-stealing with deterministic team-building which in a natural way generalizes classical work-stealing. The implementation is done with standard lock-free data structures, in addition to which only a single extra compare-and-swap (CAS) operation per thread is required as a team is being built. Once formed, teams can stay to process further tasks requiring the same (or smaller) number of threads; this requires no further coordination. In the degenerate case, where all tasks require only a single thread, the implementation coincides with a (deterministic) work-stealing implementation, has no extra overhead, and therefore similar theoretical properties. We demonstrate correctness of the generalized work-stealing algorithm by arguing for deadlock freedom and completeness (all tasks will eventually be executed, regardless of their resource requirement r ≤ p), discuss its load-balancing, task execution order and memory-consumption properties, and discuss a number of algorithmic and implementation variations that can be considered. A prototype C++ implementation of the generalized work-stealing algorithm has been given and is briefly described. Building on this, a serious, well-known contender for a best parallel Quicksort algorithm has been implemented, which naturally relies on both task and data parallelism. On an 8-core Intel Nehalem system, a 16-core AMD Opteron system, a 16-core Sun T2+ system supporting up to * The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013.[Copyright notice will appear here once 'preprint' option is removed.] 128 hardware threads, and a 32-core Intel Nehalem EX system we compare our implementation of the published Quicksort algorithm using fork-join parallelism to a mixed-mode parallel implementation with a data parallel partitioning step using our deterministic team-building work-stealer. Results are consistently better. often by a significant fraction. For instance, sorting 2 27 − 1 randomly generated integers we could improve the speed-up from 5.1 to 8.7 on the large 32-core Intel system, on this system being consistently better than the tuned, task-parallel Cilk++ system.

show abstract