Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

Drebes, Andi; Heydemann, Karine; Drach, Nathalie; Pop, Antoniu; Cohen, Albert

doi:10.1145/2641764

Cited by 36 publications

(32 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In earlier work [14], we showed that some of these issues can be mitigated by using work-pushing. Similar to the abstract model discussed above, the approach assumes that tasks communicate through task-private buffers.…”

Section: Weaknesses Of Task Parallelism On Numa Systemsmentioning

confidence: 94%

“…If work-pushing is enabled, workers can also receive tasks in a dedicated multi-producer single-consumer queue [14]. Our experiments use one worker thread per core.…”

Section: Software Environmentmentioning

confidence: 99%

“…To avoid any bias in favor of our optimizations, enhanced work-pushing and deferred allocation have been disabled during this tuning phase. In this configuration, the run-time only relies on optimized workstealing [23] extended with hierarchical work-stealing [14] Table 3: Benchmark parameters for computational load balancing. We refer to this baseline for our experiments as DSA-BASE.…”

Section: Benchmarksmentioning

confidence: 99%

“…• To enhance the locality of read memory accesses, we propose enhanced work-pushing, a work-sharing mechanism building on earlier work [14] and that interacts constructively with deferred allocation. Since the inputs of a task are outputs of another task, the location of input data is determined by deferred allocation when the producer tasks execute.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Scalable Task Parallelism for NUMA

Drebes

Pop

Heydemann

et al. 2016

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Self Cite

View full text Add to dashboard Cite

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform hardware abstraction of contemporary task-parallel programming models for both computing and memory resources with high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of accesses to task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system, and placement information from the operating system. On a 192-core system with 24 NUMA nodes, our optimizations achieve above 94% locality (fraction of local memory accesses), up to 5× better performance than NUMAaware hierarchical work-stealing, and even 5.6× compared to static interleaved allocation. Finally, we show that stateof-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

show abstract

Section: Weaknesses Of Task Parallelism On Numa Systemsmentioning

confidence: 94%

“…If work-pushing is enabled, workers can also receive tasks in a dedicated multi-producer single-consumer queue [14]. Our experiments use one worker thread per core.…”

Section: Software Environmentmentioning

confidence: 99%

Section: Benchmarksmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Scalable Task Parallelism for NUMA

Drebes

Pop

Heydemann

et al. 2016

Proceedings of the 2016 International Conference on Parallel Architectures and Compilation

Self Cite

View full text Add to dashboard Cite

show abstract

“…Since the inputs of a task are outputs for another task, the location of input data is determined when the producer task executes. This advocates for an enhanced workpushing technique, building on the algorithm proposed by Drebes et al (Drebes et al 2014), and revising it to together with deferred allocation: a task is placed according to the location of its input data before allocating memory for its outputs. This combination of enhanced work-pushing and deferred allocation is fully automatic, application-independent, portable across NUMA machines and transparently adapts to dynamic changes at run time.…”

Section: Numa-aware Optimizationsmentioning

confidence: 99%

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Drebes

Pop

Heydemann

et al. 2016

Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Self Cite

View full text Add to dashboard Cite

Dynamic task parallelism is a popular programming model on shared-memory systems. Compared to data parallel loop-based concurrency, it promises enhanced scalability, load balancing and locality. These promises, however, are undermined by non-uniform memory access (NUMA) systems. We show that it is possible to preserve the uniform hardware abstraction of contemporary taskparallel programming models, for both computing and memory resources, while achieving near-optimal data locality. Our run-time algorithms for NUMA-aware task and data placement are fully automatic, application-independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences and reuse. This information is readily available in the run-time systems of modern task-parallel programming frameworks, and from the operating system regarding the placement of previously allocated memory. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability through the elimination of false dependences and enables finegrained dynamic control over the placement of application data. We demonstrate that the benefits of dynamically managing data placement outweigh the privatization cost, even when comparing with target-specific optimizations through static, NUMA-aware data interleaving. Our implementation and the experimental evaluation on a set of high-performance benchmarks executing on a 192-core system with 24 NUMA nodes show that the fraction of local memory accesses can be increased to more than 99%, resulting in a speedup of up to 5× compared to a NUMA-aware hierarchical work-stealing baseline.

show abstract

Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures

Virouleau

Broquedis

Gautier

et al. 2016

Euro-Par 2016: Parallel Processing

View full text Add to dashboard Cite

The recent addition of data dependencies to the OpenMP 4.0 standard provides the application programmer with a more flexible way of synchronizing tasks. Using such an approach allows both the compiler and the runtime system to know exactly which data are read or written by a given task, and how these data will be used through the program lifetime. Data placement and task scheduling strategies have a significant impact on performances when considering NUMA architectures. While numerous papers focus on these topics, none of them has made extensive use of the information available through dependencies. One can use this information to modify the behavior of the application at several levels : during initialization to control data placement and during the application execution to dynamically control both the task placement and the tasks stealing strategy, depending on the topology. This paper introduces several heuristics for these strategies and their implementations in our OpenMP runtime XKAAPI. We also evaluate their performances on linear algebra applications executed on a 192-core NUMA machine, reporting noticeable performance improvement when considering both the architecture topology and the tasks data dependencies. We finally compare them to strategies presented previously by related works.

show abstract

Topology-Aware and Dependence-Aware Scheduling and Memory Allocation for Task-Parallel Languages

Cited by 36 publications

References 33 publications

Scalable Task Parallelism for NUMA

Scalable Task Parallelism for NUMA

NUMA-aware scheduling and memory allocation for data-flow task-parallel applications

Using Data Dependencies to Improve Task-Based Scheduling Strategies on NUMA Architectures

Contact Info

Product

Resources

About