Efficient scheduling of recursive control flow on GPUs

Huo, Xin; Krishnamoorthy, Sriram; Agrawal, Gagan

doi:10.1145/2464996.2479870

Cited by 11 publications

(10 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Meng et al [12], Fung et al [32], Fung et al [31], and Brunie et al [20] proposed several warp subdivision technologies to improve the resource utilities in the warp divergence. Huo et al [34] designed scheduling algorithms for recursive control flow on GPGPUs. Jablin et al [10] reorganized the instruction order to reduce the divergence time.…”

Section: Related Workmentioning

confidence: 99%

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Xiao

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

General-Purpose Graphic Processing Units (GPGPU) have been widely used in high performance computing as application accelerators due to their massive parallelism and high throughput. A GPGPU generally contains two layers of schedulers, a cooperative-thread-array (CTA) scheduler and a warp scheduler, which administer the thread level parallelism (TLP). Previous research shows the maximized TLP does not always deliver the optimal performance. Unfortunately, existing warp scheduling schemes do not optimize TLP at runtime, which is impossible to fit various access patterns for diverse applications. Dynamic TLP optimization in the warp scheduler remains a challenge to exploit the GPGPU highly-parallel compute power.In this paper, we comprehensively investigate the TLP performance impact in the warp scheduler. Based on our analysis of the pipeline efficiency, we propose a Stall-Aware Warp Scheduling (SAWS), which optimizes the TLP according to the pipeline stalls. SAWS adds two modules to the original scheduler to dynamically adjust TLP at runtime. A trigger-based method is employed for a fast tuning response. We simulated SAWS and conducted extensive experiments on GPGPU-Sim using 21 paradigmatic benchmarks. Our numerical results show that SAWS effectively improves the pipeline efficiency by reducing the structural hazards without causing extra data hazards. SAWS achieves an average speedup of 14.7% with a geometric mean, even higher than existing Two-Level scheduling scheme with the optimal fetch group sizes over a wide range of benchmarks. More importantly, compared with the dynamic TLP optimization in the CTA scheduling, SAWS still has 9.3% performance improvement among the benchmarks, which shows that it is a competitive choice by moving dynamic TLP optimization from the CTA to warp scheduler.

show abstract

Section: Related Workmentioning

confidence: 99%

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Xiao

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

View full text Add to dashboard Cite

show abstract

“…4 Vectorization was performed as described in Section 5. The benchmarks are: (1) knapsack, which computes the optimal solution to the knapsack problem [6] 5 ; (2) fib, which computes the 45-th Fibonacci number [6]; (3) parentheses, which computes the number of well-formed parentheses string combinations with 19 parentheses; (4) nqueens, which counts the number of valid solutions to the 13-queens problems [2]; (5) graphcol, which counts the number of valid ways of coloring a 38-node, 64-edge graph with three colors [17]; (6) uts, which counts the number of nodes in a probabilistic binomial tree [27]; (7) binomial, which recursively computes the combination 36C13 [17]; and (8) minmax, a min-max search for tic-tac-toe on a 4 × 4 board. Table 1 characterizes the benchmarks and their sequential execution time.…”

Section: Evaluation Platform and Benchmarksmentioning

confidence: 99%

Efficient execution of recursive programs on commodity vector hardware

Ren

Krishnamoorthy

et al. 2015

Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation

Self Cite

View full text Add to dashboard Cite

The pursuit of computational efficiency has led to the proliferation of throughput-oriented hardware, from GPUs to increasingly wide vector units on commodity processors and accelerators. This hardware is designed to efficiently execute data-parallel computations in a vectorized manner. However, many algorithms are more naturally expressed as divide-and-conquer, recursive, task-parallel computations. In the absence of data parallelism, it seems that such algorithms are not well suited to throughput-oriented architectures. This paper presents a set of novel code transformations that expose the data parallelism latent in recursive, task-parallel programs. These transformations facilitate straightforward vectorization of task-parallel programs on commodity hardware. We also present scheduling policies that maintain high utilization of vector resources while limiting space usage. Across several task-parallel benchmarks, we demonstrate both efficient vector resource utilization and substantial speedup on chips using Intel's SSE4.2 vector units, as well as accelerators using Intel's AVX512 units.

show abstract

“…If one thread in a warp makes a method call, all other threads will wait until the call returns before proceeding; as recursive calls can lead to long call chains, divergence can substantially decrease warp-level parallelism [8]. In contrast, autorope-enabled traversal algorithms do not suffer significant divergence: because the recursive method is translated into a loop over a stack, control immediately re-converges at the top of the loop, even as the threads diverge in the tree.…”

Section: Memory Coalescing and Thread Divergencementioning

confidence: 99%

“…Then, when the warp's traversal returns to the tree node which the truncated point would have visited next, it is unmasked, and resumes its computation. Essentially, lockstep traversal forces autorope implementations to implement the same thread divergence behavior the GPU naturally provides for recursive implementations [8].…”

Section: Overview Of Lockstep Traversalmentioning

confidence: 99%

“…Méndez-Lojo et al present a GPU implementation of inclusionbased points-to analysis that performs graph rewrites in terms of matrix-matrix multiplication by leveraging clever encodings of a compressed sparse row representation [17]. Huo et al examined efficient scheduling of recursive control flow on GPUs, and present results which improve upon traditional post-dominator based reconvergence mechanisms designed to handle thread divergence due to control flow within a procedure [8].…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

General transformations for GPU execution of tree traversals

Goldfarb

Kulkarni

2013

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

View full text Add to dashboard Cite

With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38× over 32-thread CPU versions.

show abstract

Efficient scheduling of recursive control flow on GPUs

Cited by 11 publications

References 27 publications

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

A Stall-Aware Warp Scheduling for Dynamically Optimizing Thread-level Parallelism in GPGPUs

Efficient execution of recursive programs on commodity vector hardware

General transformations for GPU execution of tree traversals

Contact Info

Product

Resources

About