Dynamic parallelism for simple and efficient GPU graph algorithms

Zhang, Peter; Holk, Eric; Matty, John; Misurda, Samantha; Zalewski, Marcin; Chu, Jonathan; McMillan, Scott; Lumsdaine, Andrew

doi:10.1145/2833179.2833189

Cited by 15 publications

(15 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Concerning the programmability, our results contrast with the results of Zhang et al, where the use of CDP simplified the development of GPU‐based graph algorithms. According to our experience, using CDP is challenging and brings complexity to the code.…”

Section: Discussionmentioning

confidence: 99%

“…Although results show speedups up to 2.73×, the use of CDP causes a slowdown on the overall performance of the benchmark algorithms. CDP‐based algorithms for breadth‐first search (BFS) and single‐source shortest path (SSSP) are presented in . According to the authors, CDP can simplify the development of GPU‐based graph algorithms, because the use of CDP leads to a simpler code closer to its high‐level description.…”

Section: Background and Related Workmentioning

confidence: 99%

“…20 CDP has also been used for processing irregular applications, such as graph algorithms, clustering and simulations. 21,22 In particular, 21 proposes a strategy that launches new grids when a kernel is able to find a predetermined and regular load during its execution. Although results show speedups up to 2.73×, the use of CDP causes a slowdown on the overall performance of the benchmark algorithms.…”

Section: Cuda Dynamic Parallelismmentioning

confidence: 99%

See 2 more Smart Citations

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Pessoa

Gmys

Júnior

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary New GPGPU technologies, such as CUDA Dynamic Parallelism (CDP), can help dealing with recursive patterns of computation, such as divide‐and‐conquer, used by backtracking algorithms. In this paper, we propose a GPU‐accelerated backtracking algorithm using CDP that extends a well‐known parallel backtracking model. The search starts on CPU, processing the search tree until a first cutoff depth. Based on this partial backtracking tree, the algorithm analyzes the memory requirements of subsequent kernel generations. The proposed algorithm performs no dynamic allocation of memory on GPU, unlike related works from the literature. The proposed algorithm has been extensively tested using the N‐Queens Puzzle problem and instances of the Asymmetric Traveling Salesman Problem (ATSP) as test‐cases. The proposed CDP algorithm may, under some conditions, outperform its non‐CDP counterpart by a factor up to 25. But, it may also be up to twice slower. The CDP‐based implementation has much better worst case execution times and makes algorithm's performance less dependent on the tuning of parameters. Compared to other CDP‐based strategies from the literature, the proposed algorithm is on average 8× faster. The proposed algorithm is also hybridized with another CDP‐based strategy from the literature. The combination of strategies is in average 4.5× faster than the related strategy. We also identify some difficulties, limitations, and bottlenecks concerning the CDP programming model which may be useful for helping potential users.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Background and Related Workmentioning

confidence: 99%

Section: Cuda Dynamic Parallelismmentioning

confidence: 99%

See 1 more Smart Citation

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Pessoa

Gmys

Júnior

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

show abstract

“…The parallelization of irregular applications using CDP has received little attention in the literature. Particularly, CDP has been used for processing graphs, clustering, simulations, and backtracking algorithms [1,7,11,13,14]. According to related works, CDP is beneficial for processing applications whose data are hierarchically arranged.…”

Section: Related Work On Cuda Dynamic Parallelismmentioning

confidence: 99%

Dynamic Configuration of CUDA Runtime Variables for CDP-Based Divide-and-Conquer Algorithms

Carneiro

Gmys

Melab

et al. 2019

Lecture Notes in Computer Science

View full text Add to dashboard Cite

CUDA Dynamic Parallelism (CDP) is an extension of the GPGPU programming model proposed to better address irregular applications and recursive patterns of computation. However, processing memory demanding problems by using CDP is not straightforward, because of its particular memory organization. This work presents an algorithm to deal with such an issue. It dynamically calculates and configures the CDP runtime variables and the GPU heap on the basis of an analysis of the partial backtracking tree. The proposed algorithm was implemented for solving permutation combinatorial problems and experimented on two test-cases: N-Queens and the Asymmetric Travelling Salesman Problem. The proposed algorithm allows different CDP-based backtracking from the literature to solve memory demanding problems, adaptively with respect to the number of recursive kernel generations and the presence of dynamic allocations on GPU.

show abstract

“…Tests were performed on an NVIDIA K20c GPU. In paper [30] authors analyzed performance of a DP-enabled algorithms for breadth-first search (BFS) and single-source shortest paths (SSSP) algorithms compared to other existing implementations showing performance better than some but not the best (compared to algorithms with advanced queueing for SSSP) results. Authors of paper [15] state that they obtained over 2.6 speedups for SSSP and over 1.4 for sparse matrix-vector multiplication (SpMV) codes compared to basic implementations without DP.…”

Section: Dynamic Parallelismmentioning

confidence: 99%

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Jarząbek

Czarnul

2017

J Supercomput

View full text Add to dashboard Cite

The aim of this paper is to evaluate performance of new CUDA mechanisms-unified memory and dynamic parallelism for real parallel applications compared to standard CUDA API versions. In order to gain insight into performance of these mechanisms, we decided to implement three applications with control and data flow typical of SPMD, geometric SPMD and divide-and-conquer schemes, which were then used for tests and experiments. Specifically, tested applications include verification of Goldbach's conjecture, 2D heat transfer simulation and adaptive numerical integration. We experimented with various ways of how dynamic parallelism can be deployed into an existing implementation and be optimized further. Subsequently, we compared the best dynamic parallelism and unified memory versions to respective standard API counterparts. It was shown that usage of dynamic parallelism resulted in improvement in performance for heat simulation, better than static but worse than an iterative version for numerical integration and finally worse results for Golbach's conjecture verification. In most cases, unified memory results in decrease in performance. On the other hand, both mechanisms can contribute to simpler and more readable codes. For dynamic parallelism, it applies to algorithms in which it can be naturally applied. Unified memory generally makes it easier for a programmer to enter the CUDA programming paradigm as it resembles the traditional memory allocation/usage pattern.

show abstract

Dynamic parallelism for simple and efficient GPU graph algorithms

Cited by 15 publications

References 10 publications

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Dynamic Configuration of CUDA Runtime Variables for CDP-Based Divide-and-Conquer Algorithms

Performance evaluation of unified memory and dynamic parallelism for selected parallel CUDA applications

Contact Info

Product

Resources

About