Getting performance on high-end heterogeneous nodes is challenging. This is due to the large semantic gap between a computation's specification-possibly mathematical formulas or an abstract sequential algorithm-and its parallel implementation; this gap obscures the program's parallel structures and how it gains or loses performance. We present Hedgehog, a library aimed at coarse-grain parallelism. It explicitly embeds a data-flow graph in a program and uses this graph at runtime to drive the program's execution so it takes advantage of hardware parallelism (multicore CPUs and multiple accelerators). Hedgehog has asynchronicity built in. It statically binds individual threads to graph nodes, which are ready to fire when any of their inputs are available. This allows Hedgehog to avoid using a global scheduler and the loss of performance associated with global synchronizations and managing of thread pools. Hedgehog provides a separation of concerns and distinguishes between compute and state maintenance tasks. Its API reflects this separation and allows a developer to gain a better understanding of performance when executing the graph. Hedgehog is implemented as a C ++ 17 headers-only library. One feature of the framework is its low overhead; it transfers control of data between two nodes in ≈ 1 µs. This low overhead combines with Hedgehog's API to provide essentially cost-free profiling of the graph, thereby enabling experimentation for performance, which enhances a developer's insight into a program's performance.Hedgehog's asynchronous data-flow graph supports a data streaming programming model both within and between graphs. We demonstrate the effectiveness of this approach by highlighting the performance of streaming implementations of two numerical linear algebra routines, which are comparable to existing libraries: matrix multiplication achieves >95 % of the theoretical peak of 4 GPUs; LU decomposition with partial pivoting starts streaming partial final result blocks 40× earlier than waiting for the full result. The relative ease and understandability of obtaining performance with Hedgehog promises to enable non-specialists to target performance on high-end single nodes.
We have developed streaming implementations of two numerical linear algebra operations that further exploit the block decomposition strategies commonly used in these operations to obtain performance. The implementations formulate algorithms as data flow graphs and use coarse-grained parallelism to (1) emit a block in the result matrix as soon as it becomes available and (2) compute on multiple blocks in parallel. This streaming design benefits data flow graphs consisting of multiple linear algebra operations as it removes synchronization points between successive operations: a result block from an operation can be used immediately in an algorithm's successor operations without waiting for the full result from the first operation. Early comparisons with OpenBLAS functions on CPUs show comparable performance for computing with large dense matrices and an earliest arrival time of a result block that is up to 50x smaller than the time needed for a full result. More thorough studies can show the impact of such implementations on the performance of systems by chaining multiple linear algebra operations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.