Traversing large graphs on GPUs with unified memory

Gera, Prasun; Kim, Hyojong; Sao, Piyush; Kim, Hyesoon; Bader, David A.

doi:10.14778/3384345.3384358

Cited by 43 publications

(8 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Evaluation of Azimov's algorithm shows that it is possible to improve performance by using GPGPU because operations of linear algebra can be efficiently implemented on GPGPU (Mishin et al, 2019;Terekhov et al, 2020). Moreover, for practical reasons, it is interesting to provide a multi-GPU version of the algorithm and to utilize unified memory, which is suitable for linear algebra based processing of out-of-GPGPU-memory data and traversing on large graphs (Chien et al, 2019;Gera et al, 2020).…”

Section: Discussionmentioning

confidence: 99%

One Algorithm to Evaluate Them All: Unified Linear Algebra Based Approach to Evaluate Both Regular and Context-Free Path Queries

Shemetova,

Azimov,

Orachev

et al. 2021

Preprint

View full text Add to dashboard Cite

The Kronecker product-based algorithm for context-free path querying (CFPQ) was proposed by Orachev et al. (2020). We reduce this algorithm to operations over Boolean matrices and extend it with the mechanism to extract all paths of interest. We also prove O(n 3 / log n) time complexity of the proposed algorithm, where n is a number of vertices of the input graph. Thus, we provide the alternative way to construct a slightly subcubic algorithm for CFPQ which is based on linear algebra and incremental transitive closure (a classic graph-theoretic problem), as opposed to the algorithm with the same

show abstract

Section: Discussionmentioning

confidence: 99%

One Algorithm to Evaluate Them All: Unified Linear Algebra Based Approach to Evaluate Both Regular and Context-Free Path Queries

Shemetova,

Azimov,

Orachev

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…We evaluate competing systems' performance on three applications, BC, LL, and NCP. To keep consistent with previous work [1,23,53], we configure the three applications as follows.…”

Section: Methodsmentioning

confidence: 99%

“…The algorithm samples the starting nodes from the graph. Therefore, in our evaluation, we randomly sample a batch of 100 source vertices for each graph [23].…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Cache-Efficient Fork-Processing Patterns on Large Graphs

Lu,

Sun,

Paul

et al. 2021

Preprint

View full text Add to dashboard Cite

As large graph processing emerges, we observe a costly fork-processing pattern (FPP) common in many graph algorithms. The unique feature of the FPP is that it launches many independent queries from different source vertices on the same graph. For example, an algorithm in analyzing the network community profile can execute Personalized PageRanks that start from tens of thousands of source vertices at the same time. We study the efficiency of state-of-the-art graph processing systems on multi-core architectures, including Ligra, Gemini, and GraphIt. We find that those systems suffer from severe cache miss penalty because of the irregular and uncoordinated memory accesses in processing FPPs.In this paper, we propose ForkGraph, a cache-efficient FPP processing system on multi-core architectures. In order to improve the cache reuse, we divide the graph into partitions each sized of LLC (last-level cache) capacity, and the queries in an FPP are buffered and executed on the partition basis. We further develop efficient intra-and inter-partition execution strategies for efficiency. For intra-partition processing, since the graph partition fits into LLC, we propose to execute each graph query with efficient sequential algorithms (in contrast with parallel algorithms in existing parallel graph processing systems) and present an atomic-free query processing by consolidating contending operations to cache-resident graph partition. For inter-partition processing, we propose two designs, yielding and priority-based scheduling, to reduce redundant work in processing. Besides, we theoretically prove that ForkGraph performs the same amount of work, to within a constant factor, as the fastest known sequential algorithms in FPP queries processing, which is work efficient. Our evaluations on real-world graphs show that ForkGraph significantly outperforms state-of-the-art graph processing systems (including Ligra, Gemini, and GraphIt) with two orders of magnitude speedups.

show abstract

“…Third, an extremely large graph, which drives the needs of graph sampling and random walk, usually goes beyond the size of GPU memory. While there exists an array of solutions for GPU-based large graph processing, namely, unified memory [26], topology-aware partition [27] and vertex-range based partitions [28], graph sampling and random walk algorithms, which require all the neighbors of a vertex to present in order to compute the selection probability, exhibit stringent requirement on the partitioning methods. In the meantime, the asynchronous and out-of-order nature of graph sampling and random walk provides some unique optimization opportunities for out-of-memory sampling, which are neither shared nor explored by traditional out-of-memory systems.…”

Section: Introductionmentioning

confidence: 99%

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

Pandey,

Li,

Hoisie

et al. 2020

Preprint

View full text Add to dashboard Cite

Many applications require to learn, mine, analyze and visualize large-scale graphs. These graphs are often too large to be addressed efficiently using conventional graph processing technologies. Fortunately, recent research efforts find out graph sampling and random walk, which significantly reduce the size of original graphs, can benefit the tasks of learning, mining, analyzing and visualizing large graphs by capturing the desirable graph properties. This paper introduces C-SAW, the first framework that accelerates Sampling and Random Walk framework on GPUs. Particularly, C-SAW makes three contributions: First, our framework provides a generic API which allows users to implement a wide range of sampling and random walk algorithms with ease. Second, offloading this framework on GPU, we introduce warp-centric parallel selection, and two novel optimizations for collision migration. Third, towards supporting graphs that exceed the GPU memory capacity, we introduce efficient data transfer optimizations for out-of-memory and multi-GPU sampling, such as workload-aware scheduling and batched multi-instance sampling. Taken together, our framework constantly outperforms the state of the art projects in addition to the capability of supporting a wide range of sampling and random walk algorithms.

show abstract

Traversing large graphs on GPUs with unified memory

Cited by 43 publications

References 46 publications

One Algorithm to Evaluate Them All: Unified Linear Algebra Based Approach to Evaluate Both Regular and Context-Free Path Queries

One Algorithm to Evaluate Them All: Unified Linear Algebra Based Approach to Evaluate Both Regular and Context-Free Path Queries

Cache-Efficient Fork-Processing Patterns on Large Graphs

C-SAW: A Framework for Graph Sampling and Random Walk on GPUs

Contact Info

Product

Resources

About