Optimizing ordered graph algorithms with GraphIt

Zhang, Yunming; Brahmakshatriya, Ajay; Chen, Xinyi; Dhulipala, Laxman; Kamil, Shoaib; Amarasinghe, Saman; Shun, Julian

doi:10.1145/3368826.3377909

Cited by 28 publications

(20 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…While Prodigy supports prefetching for indirect memory accesses, it does not account for this additional control-flow information for prefetching. Similar trends might be observed for ordered graph algorithms [28], [103] because node priority is not accounted for prefetching. In such cases, Prodigy might prefetch inaccurate vertices, and we envision using a mechanism that disables the prefetcher when it detects cache thrashing [88].…”

Section: G Limitations Of Prodigysupporting

confidence: 74%

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati

May

Behroozi

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

Irregular workloads are typically bottlenecked by the memory system. These workloads often use sparse data representations, e.g., compressed sparse row/column (CSR/CSC), to conserve space at the cost of complicated, irregular traversals. Such traversals access large volumes of data and offer little locality for caches and conventional prefetchers to exploit. This paper presents Prodigy, a low-cost hardware-software codesign solution for intelligent prefetching to improve the memory latency of several important irregular workloads. Prodigy targets irregular workloads including graph analytics, sparse linear algebra, and fluid mechanics that exhibit two specific types of datadependent memory access patterns. Prodigy adopts a "best of both worlds" approach by using static program information from software, and dynamic run-time information from hardware. The core of the system is the Data Indirection Graph (DIG)-a proposed compact representation used to express program semantics such as the layout and memory access patterns of key data structures. The DIG representation is agnostic to a particular data structure format and is demonstrated to work with several sparse formats including CSR and CSC. Program semantics are automatically captured with a compiler pass, encoded as a DIG, and inserted into the application binary. The DIG is then used to program a low-cost hardware prefetcher to fetch data according to an irregular algorithm's data structure traversal pattern. We equip the prefetcher with a flexible prefetching algorithm that maintains timeliness by dynamically adapting its prefetch distance to an application's execution pace.We evaluate the performance, energy consumption, and transistor cost of Prodigy using a variety of algorithms from the GAP, HPCG, and NAS benchmark suites. We compare the performance of Prodigy against a non-prefetching baseline as well as stateof-the-art prefetchers. We show that by using just 0.8KB of storage, Prodigy outperforms a non-prefetching baseline by 2.6× and saves energy by 1.6×, on average. Prodigy also outperforms modern data prefetchers by 1.5-2.3×.Index Terms-DRAM stalls, irregular workloads, graph processing, hardware-software co-design, programming model, programmer annotations, compiler, and hardware prefetching. Program the prefetcher Program the prefetcher Generate prefetch requests Generate prefetch requestsCompiler analysis Run application Software Hardware Instrumented application binary Application source code Add DIG representation Data Indirection Graph (DIG)

show abstract

Section: G Limitations Of Prodigysupporting

confidence: 74%

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Talati

May

Behroozi

et al. 2021

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)

View full text Add to dashboard Cite

show abstract

“…It improves the cost bounds for existing parallel SSSP algorithms such as Radius-Stepping [25] and Shi-Spencer [78]. In practice, we show simple implementations based on flat arrays, which makes our stepping algorithms outperform state-of-the-art software [11,45,71,92].…”

Section: Gapbsmentioning

confidence: 78%

“…We note that FinishCheck is not necessary for Δ-Stepping, just like other stepping algorithms. In fact, all existing implementations [11,45,71,92] relaxed FinishCheck in different ways. In this paper, we show that removing FinishCheck in Δ-Stepping (referred to as Δ * -Stepping) can lead to better bounds (Thm.…”

Section: Framework 31 the Lab-pq Abstractionmentioning

confidence: 99%

“…Despite dozens of papers and implementations over the past decades, all existing solutions have some unsatisfactory aspects. Practically, most existing parallel SSSP implementations [11,45,71,92] are based on Δ-Stepping [69], which is a hybrid of Dijkstra's algorithm [48] and the Bellman-Ford algorithm [12,51]. It determines the correct shortest distances in increments of Δ, in step settling down all the vertices with distances in [ Δ, ( + 1)Δ].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Optimal Parallel Algorithms in the Binary-Forking Model

Blelloch

Fineman

et al. 2020

Proceedings of the 32nd ACM Symposium on Parallelism in Algorithms and Architectures

View full text Add to dashboard Cite

In this paper we develop optimal algorithms in the binary-forking model for a variety of fundamental problems, including sorting, semisorting, list ranking, tree contraction, range minima, and ordered set union, intersection and difference. In the binary-forking model, tasks can only fork into two child tasks, but can do so recursively and asynchronously. The tasks share memory, supporting reads, writes and test-and-sets. Costs are measured in terms of work (total number of instructions), and span (longest dependence chain).The binary-forking model is meant to capture both algorithm performance and algorithm-design considerations on many existing multithreaded languages, which are also asynchronous and rely on binary forks either explicitly or under the covers. In contrast to the widely studied PRAM model, it does not assume arbitrary-way forks nor synchronous operations, both of which are hard to implement in modern hardware. While optimal PRAM algorithms are known for the problems studied herein, it turns out that arbitrary-way forking and strict synchronization are powerful, if unrealistic, capabilities. Natural simulations of these PRAM algorithms in the binary-forking model (i.e., implementations in existing parallel languages) incur an Ω(log n) overhead in span. This paper explores techniques for designing optimal algorithms when limited to binary forking and assuming asynchrony. All algorithms described in this paper are the first algorithms with optimal work and span in the binary-forking model. Most of the algorithms are simple. Many are randomized.

show abstract

“…Many graph engines and frameworks have been developed in the past decade. Some of them are sharedmemory, focusing on processing in-memory datasets within a computation node, e.g., Galois [57], Ligra [63], Polymer [76], Graph-Grind [67], GraphIt [78], and Graptor [69]. Some are distributed systems, e.g., Pregel [50], GraphLab [48], and PowerGraph [25].…”

Section: Related Workmentioning

confidence: 99%

Speed-ANN: Low-Latency and High-Accuracy Nearest Neighbor Search via Intra-Query Parallelism

Peng¹,

Zhang²,

Li³

et al. 2022

Preprint

View full text Add to dashboard Cite

Nearest Neighbor Search (NNS) has recently drawn a rapid increase of interest due to its core role in managing high-dimensional vector data in data science and AI applications. The interest is fueled by the success of neural embedding, where deep learning models transform unstructured data into semantically correlated feature vectors for data analysis, e.g., recommend popular items. Among several categories of methods for fast NNS, similarity graph is one of the most successful algorithmic trends. Several of the most popular and top-performing similarity graphs, such as NSG and HNSW, at their core employ best-first traversal along the underlying graph indices to search near neighbors. Maximizing the performance of the search is essential for many tasks, especially at the large-scale and highrecall regime. In this work, we provide an in-depth examination of the challenges of the state-of-the-art similarity search algorithms, revealing its challenges in leveraging multi-core processors to speed up the search efficiency. We also exploit whether similarity graph search is robust to deviation from maintaining strict order by allowing multiple walkers to simultaneously advance the search frontier. Based on our insights, we propose Speed-ANN , a parallel similarity search algorithm that exploits hidden intra-query parallelism and memory hierarchy that allows similarity search to take advantage of multiple CPU cores to significantly accelerate search speed while achieving high accuracy.We evaluate Speed-ANN on a wide range of datasets, ranging from million to billion data points, and show that it reduces query latency by 2.1×, 5.2×, and 13× on average than NSG and 2.1×, 6.7×, and 17.8× on average than HNSW at 0.9, 0.99, and 0.999 recall target, respectively. More interesting, our approach achieves super-linear speedups in some cases using 32 threads, achieving up to 37.7 times and 76.6 times faster to obtain the same accuracy than two state-of-the-art graph-based nearest neighbor search methods NSG and HNSW, respectively. Finally, with multicore support, we show that our approach offers faster search latency than highlyoptimized GPU implementation and provides good scalability as the increase of the number of hardware resources (e.g., CPU cores) and graph sizes, offering up to 16.0× speedup on two billion-scale datasets.

show abstract

Optimizing ordered graph algorithms with GraphIt

Cited by 28 publications

References 37 publications

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Prodigy: Improving the Memory Latency of Data-Indirect Irregular Workloads Using Hardware-Software Co-Design

Optimal Parallel Algorithms in the Binary-Forking Model

Speed-ANN: Low-Latency and High-Accuracy Nearest Neighbor Search via Intra-Query Parallelism

Contact Info

Product

Resources

About