Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Muddukrishna, Ananya; Jönsson, Peter; Brorsson, Mats

doi:10.1155/2015/981759

Cited by 17 publications

(18 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Scheduling to improve data locality and minimizing NUMA effects in shared memory task parallel execution is an active research area [6,21,22,23,24,25,26,27] and can also be coupled to energy considerations [28,29]. : Left: A small task graph where all accesses (the type is indicated for each task) are assumed to be to the same shared data.…”

Section: Tracking Dependencies Through Data Versioningmentioning

confidence: 99%

DuctTeip: An efficient programming model for distributed task-based parallel computing

2019

View full text Add to dashboard Cite

Current high-performance computer systems used for scientific computing typically combine shared memory computational nodes in a distributed memory environment. Extracting high performance from these complex systems requires tailored approaches. Task based parallel programming has been successful both in simplifying the programming and in exploiting the available hardware parallelism for shared memory systems. In this paper we focus on how to extend task parallel programming to distributed memory systems. We use a hierarchical decomposition of tasks and data in order to accommodate the different levels of hardware. We test the proposed programming model on two different applications, a Cholesky factorization, and a solver for the Shallow Water Equations. We also compare the performance of our implementation with that of other frameworks for distributed task parallel programming, and show that it is competitive. arXiv:1801.03578v1 [cs.DC] 10 Jan 2018 to the computational work performed by one node, the 1×9 process grid has the smallest variance between nodes, and therefore also the lowest maximum work size. The 9 × 1 process grid leads to smaller maximum work size than the 3 × 3 process grid if B is large enough, but suffers from significant load imbalance in the case B = 18. In all cases, the work becomes more evenly distributed if the number of level 1 tasks B is larger. The statistics for communication and computation point in different directions, but when comparing with actual run times, we have found that the communication size is the most informative measure. Having a large total communication size is likely to be detrimental to performance as the risk of tasks left waiting for remote data increases as well as the risk of congestion of messages. A square process grid is the factor that has the largest impact. Regarding the block sizes, having a large B improves the load balance, but increases the amount of communication as well as the number of messages (another indicator that is not shown in the graphics).

show abstract

Section: Tracking Dependencies Through Data Versioningmentioning

confidence: 99%

DuctTeip: An efficient programming model for distributed task-based parallel computing

2019

View full text Add to dashboard Cite

show abstract

“…Other Approaches. Muddukrishna et al (2016) use a locality aware runtime and user annotations to distribute data to different NUMA nodes. They introduce work-stealing and work-dealing algorithms that take queue sizes and node distance into account before stealing (dealing) tasks from (to) other nodes.…”

Section: Related Workmentioning

confidence: 99%

Blaze-Tasks

Pirkelbauer

Wilson

Peterson

et al. 2018

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Compared to threads, tasks are a more fine-grained alternative. The task parallel programming model offers benefits in terms of better performance portability and better load-balancing for problems that exhibit nonuniform workloads. A common scenario of task parallel programming is that a task is recursively decomposed into smaller sub-tasks. Depending on the problem domain, the number of created sub-tasks may be nonuniform, thereby creating potential for significant load imbalances in the system. Dynamic load-balancing mechanisms will distribute the tasks across available threads. The final result of a computation may be modeled as a reduction over the results of all sub-tasks.This article describes a simple, yet effective prototype framework, Blaze-Tasks, for task scheduling and task reductions on shared memory architectures. The framework has been designed with lock-free techniques and generic programming principles in mind. Blaze-Tasks is implemented entirely in C++17 and is thus portable. To load-balance the computation, Blaze-Tasks uses task stealing. To manage contention on a task pool, the number of lock-free attempts to steal a task depends on the distance between thief and pool owner and the estimated number of tasks in a victim's pool. This article evaluates the Blaze framework on Intel and IBM dual-socket systems using nine benchmarks and compares its performance with other task parallel frameworks. While Cilk outperforms Blaze on Intel on most benchmarks, the evaluation shows that Blaze is competitive with OpenMP and other library-based implementations. On IBM, the experiments show that Blaze outperforms other approaches on most benchmarks.

show abstract

“…A common approach is to build data locality aware compilers [11], e.g. locality aware scheduling of OpenMP tasks on multicore CPUs [9] and mapping nested access patterns on GPUs [8]. Minimising cache misses involves profiling cache traces, moreover trading function inlining with executable size, and managing memory pressure.…”

Section: Data Localitymentioning

confidence: 99%

A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs

Stewart¹,

Michaelson²,

Bhowmik³

et al. 2016

Algorithms and Architectures for Parallel Processing

View full text Add to dashboard Cite

Abstract. Field programmable gate arrays (FPGAs) are fundamentally different to fixed processors architectures because their memory hierarchies can be tailored to the needs of an algorithm. FPGA compilers for high level languages are not hindered by fixed memory hierarchies. The constraint when compiling to FPGAs is the availability of resources. In this paper we describe how the dataflow intermediary of our declarative FPGA image processing DSL called RIPL 3 enables us to constrain memory. We use five benchmarks to demonstrate that memory use with RIPL is comparable to the Vivado HLS OpenCV library without the need for language pragmas to guide hardware synthesis. The benchmarks also show that RIPL is more expressive than the Darkroom FPGA image processing language.

show abstract

Locality-Aware Task Scheduling and Data Distribution for OpenMP Programs on NUMA Systems and Manycore Processors

Cited by 17 publications

References 31 publications

DuctTeip: An efficient programming model for distributed task-based parallel computing

DuctTeip: An efficient programming model for distributed task-based parallel computing

Blaze-Tasks

A Dataflow IR for Memory Efficient RIPL Compilation to FPGAs

Contact Info

Product

Resources

About