Proceedings of the 27th International ACM Conference on International Conference on Supercomputing 2013
DOI: 10.1145/2464996.2479870
|View full text |Cite
|
Sign up to set email alerts
|

Efficient scheduling of recursive control flow on GPUs

Abstract: Graphics processing units (GPUs) have rapidly emerged as a very significant player in high performance computing. Single instruction multiple thread (SIMT) pipelines are typically used in GPUs to exploit parallelism and maximize performance. Although support for unstructured control flow has been included in GPUs, efficiently managing thread divergence for arbitrary parallel programs remains a critical challenge. In this paper, we focus on the problem of supporting recursion in modern GPUs. We design and compa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
10
0

Year Published

2013
2013
2019
2019

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 27 publications
0
10
0
Order By: Relevance
“…Meng et al [12], Fung et al [32], Fung et al [31], and Brunie et al [20] proposed several warp subdivision technologies to improve the resource utilities in the warp divergence. Huo et al [34] designed scheduling algorithms for recursive control flow on GPGPUs. Jablin et al [10] reorganized the instruction order to reduce the divergence time.…”
Section: Related Workmentioning
confidence: 99%
“…Meng et al [12], Fung et al [32], Fung et al [31], and Brunie et al [20] proposed several warp subdivision technologies to improve the resource utilities in the warp divergence. Huo et al [34] designed scheduling algorithms for recursive control flow on GPGPUs. Jablin et al [10] reorganized the instruction order to reduce the divergence time.…”
Section: Related Workmentioning
confidence: 99%
“…4 Vectorization was performed as described in Section 5. The benchmarks are: (1) knapsack, which computes the optimal solution to the knapsack problem [6] 5 ; (2) fib, which computes the 45-th Fibonacci number [6]; (3) parentheses, which computes the number of well-formed parentheses string combinations with 19 parentheses; (4) nqueens, which counts the number of valid solutions to the 13-queens problems [2]; (5) graphcol, which counts the number of valid ways of coloring a 38-node, 64-edge graph with three colors [17]; (6) uts, which counts the number of nodes in a probabilistic binomial tree [27]; (7) binomial, which recursively computes the combination 36C13 [17]; and (8) minmax, a min-max search for tic-tac-toe on a 4 × 4 board. Table 1 characterizes the benchmarks and their sequential execution time.…”
Section: Evaluation Platform and Benchmarksmentioning
confidence: 99%
“…If one thread in a warp makes a method call, all other threads will wait until the call returns before proceeding; as recursive calls can lead to long call chains, divergence can substantially decrease warp-level parallelism [8]. In contrast, autorope-enabled traversal algorithms do not suffer significant divergence: because the recursive method is translated into a loop over a stack, control immediately re-converges at the top of the loop, even as the threads diverge in the tree.…”
Section: Memory Coalescing and Thread Divergencementioning
confidence: 99%
“…Then, when the warp's traversal returns to the tree node which the truncated point would have visited next, it is unmasked, and resumes its computation. Essentially, lockstep traversal forces autorope implementations to implement the same thread divergence behavior the GPU naturally provides for recursive implementations [8].…”
Section: Overview Of Lockstep Traversalmentioning
confidence: 99%
See 1 more Smart Citation