Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes

Lee, Wonchan; Slaughter, Elliott; Bauer, Michael; Treichler, Sean; Warszawski, Todd; Garland, Michael; Aiken, Alex

doi:10.1109/sc.2018.00037

Cited by 17 publications

(8 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Both LB and WS approaches have their advantages. Notably, WS is often applied in task-parallel Runtime Systems (RTSs) [14], and shared memory scenarios [15,16] (even though it is distributed in nature), although it has been used for highly unpredictable applications in distributed memory as well [17].…”

Section: Related Workmentioning

confidence: 99%

PackStealLB: A scalable distributed load balancer based on work stealing and workload discretization

Freitas

Pilla

Santana

et al. 2021

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

The scalability of high-performance, parallel iterative applications is directly affected by how well they use the available computing resources. These applications are subject to load imbalance due to the nature and dynamics of their computations. It is common that high performance systems employ periodic load balancing to tackle this issue. Dynamic load balancing algorithms redistribute the application's workload using heuristics to circumvent the NP-hard complexity of the problem However, scheduling heuristics must be fast to avoid hindering application performance when distributing the workload on large and distributed environments. In this work, we present a technique for low overhead, high quality scheduling decisions for parallel iterative applications. The technique relies on combined application workload information paired with distributed scheduling algorithms. An initial distributed step among scheduling agents group application tasks in packs of similar load to minimize messages among them. This information is used by our scheduling algorithm, Pack-StealLB, for its distributed-memory work stealing heuristic. Experimental results showed that PackStealLB is able to improve the performance of a molecular dynamics benchmark by up to 41%, outperforming other scheduling algorithms in most scenarios over almost one thousand cores.

show abstract

Section: Related Workmentioning

confidence: 99%

PackStealLB: A scalable distributed load balancer based on work stealing and workload discretization

Freitas

Pilla

Santana

et al. 2021

Journal of Parallel and Distributed Computing

View full text Add to dashboard Cite

show abstract

“…Dynamic traces have seen some limited use as an extension to JIT. If kernels are labeled in advance, a minimal dynamic trace that primarily stores address information can be used to detect dependencies [21]. This currently relies upon having the kernel labels available in advance, and without this information.…”

Section: Background and Motivationmentioning

confidence: 99%

Automated Parallel Kernel Extraction from Dynamic Application Traces

Uhrie,

Chakrabarti,

Brunhaver

2020

Preprint

View full text Add to dashboard Cite

Modern program runtime is dominated by segments of repeating code called kernels. Kernels are accelerated by increasing memory locality, increasing data-parallelism, and exploiting producer-consumer parallelism among kernels -which requires hardware specialized for a particular class of kernels. Programming this hardware can be difficult, requiring that the kernels be identified and annotated in the code or translated to a domain-specific language. This paper describes a technique to automatically localize parallel kernels from a dynamic application trace, facilitating further code optimization. Dynamic trace collection is fast and compact. With optimization, it only incurs a time-dilation of a factor on nine and file-size of one megabyte per second, addressing a significant criticism of this approach. Kernel extraction is accurate and performed in linear time within logarithmic memory, detecting a wide range of kernels. This approach was validated across 16 libraries, comprised of 10,507 kernels instances. To validate the accuracy of our detected kernels, five test programs were written that spans traditional kernel definitions and were certified to contain all the kernels that were expected.

show abstract

“…TPS results are difficult to interpret and apply, because efficiency (and thus the amount of useful work) is not constrained. With empty tasks [28], the resulting upper bound on task scheduling throughput fails to represent useful work within a realistic application. With non-empty tasks, since the efficiency of the overall application is typically not reported [3,6], TPS is not a measurement of runtime-limited performance.…”

Section: Metgmentioning

confidence: 99%

“…Intuitively, for a given workload, METG(50%) is the smallest task granularity that maintains at least 50% efficiency, meaning that the application achieves at least 50% of the highest performance (in FLOP/s, B/s, or other application-specific measure) achieved on a given machine. The efficiency bound in METG is a key innovation over previous approaches, such as tasks per second (TPS), that fail to consider the amount of useful work performed (if tasks are non-empty [3,6]) or to perform useful work at all (if tasks are empty [28]).…”

Section: Introductionmentioning

confidence: 99%

Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance

Slaughter,

Wu,

et al. 2019

Preprint

Self Cite

View full text Add to dashboard Cite

We present Task Bench, a parameterized benchmark designed to explore the performance of parallel and distributed programming systems under a variety of application scenarios. Task Bench lowers the barrier to benchmarking multiple programming systems by making the implementation for a given system orthogonal to the benchmarks themselves: every benchmark constructed with Task Bench runs on every Task Bench implementation. Furthermore, Task Bench's parameterization enables a wide variety of benchmark scenarios that distill the key characteristics of larger applications.We conduct a comprehensive study with implementations of Task Bench in 15 programming systems on up to 256 Haswell nodes of the Cori supercomputer. We introduce a novel metric, minimum effective task granularity to study the baseline runtime overhead of each system. We show that when running at scale, 100 µs is the smallest granularity that even the most efficient systems can reliably support with current technologies. We also study each system's scalability, ability to hide communication and mitigate load imbalance.

show abstract

Dynamic Tracing: Memoization of Task Graphs for Dynamic Task-Based Runtimes

Cited by 17 publications

References 21 publications

PackStealLB: A scalable distributed load balancer based on work stealing and workload discretization

PackStealLB: A scalable distributed load balancer based on work stealing and workload discretization

Automated Parallel Kernel Extraction from Dynamic Application Traces

Task Bench: A Parameterized Benchmark for Evaluating Parallel Runtime Performance

Contact Info

Product

Resources

About