Controlled Kernel Launch for Dynamic Parallelism in GPUs

Tang, Xulong; Pattnaik, Ashutosh; Jiang, Huaipan; Kayıran, Onur; Jog, Adwait; Pai, Sreepathi; Ibrahim, Mohamed; Kandemir, Mahmut; Das, Chita R.

doi:10.1109/hpca.2017.14

Cited by 36 publications

(18 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In this paper, it is not our intention to discuss, compare and/or quantify dynamic parallelism overhead with the counterpart approaches. Some of the works on the comparison and discussion of dynamic parallelism overhead are [59,[62][63][64].…”

Section: Dynamic Parallelismmentioning

confidence: 99%

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

et al. 2019

View full text Add to dashboard Cite

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

show abstract

Section: Dynamic Parallelismmentioning

confidence: 99%

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

et al. 2019

View full text Add to dashboard Cite

show abstract

“…They show that there is potential for speedup in several problems with inhomogeneous workload but that the greater overhead of launching kernels on the device can negate the benefits. Tang et al [12] discuss a dynamic platform which seeks to launch device side kernels only when the potential computation time outweighs the launch overhead. They show good speedup for several benchmark problems.…”

Section: Dynamic Parallelismmentioning

confidence: 99%

Fast Distance Fields for Fluid Dynamics Mesh Generation on Graphics Hardware

Roosing¹

2019

CICP

View full text Add to dashboard Cite

We present a CUDA accelerated implementation of the Characteristic/Scan Conversion algorithm to generate narrow band signed distance fields in logically Cartesian grids. We outline an approach of task and data management on GPUs based on an input of a closed triangulated surface with the aim of reducing pre-processing and mesh-generation times. The work demonstrates a fast signed distance field generation of triangulated surfaces with tens of thousands to several million features in high resolution domains. We present improvements to the robustness of the original algorithm and an overview of handling geometric data.

show abstract

“…Maestro [36] dynamically selects SMK versus spatial multitasking. A number of papers target dynamic parallelism (DP), in which a kernel launches child kernels to increase resource utilization, and reduce the launch overhead, exploit data locality and improve load balancing [9], [20], [41], [46], [47]. All of these prior works focus on resource partitioning and optimization within a conventional GPU; none of these prior works explore the opportunity for exploiting TLP-resource diversity.…”

Section: Related Workmentioning

confidence: 99%

HeteroCore GPU to Exploit TLP-Resource Diversity

Zhao

Wang

Eeckhout

2019

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Graphics processing units (GPUs) are widely adopted as compute accelerators in cloud computing environments and supercomputers. Sharing GPU resources in such environments requires effective multitasking support. Unfortunately, conventional GPUs lack the ability to adapt to diverse thread-level parallelism (TLP) resource demands among co-executing kernels. Previous work such as SM partitioning and simultaneously multitasking (SMK) increase system throughput, however, they degrade per-application performance significantly. This paper proposes the HeteroCore GPU to significantly improve multitasking performance with a similar area cost as a conventional GPU. After rebalancing TLP-related SM resources, a HeteroCore GPU consists of two types of SMs to support diverse TLP-resource demands. Dynamic scheduling performs low-overhead spatial profiling during runtime across the different SM types and steers scheduling decisions based on the TLP-resource demands of the co-executing kernels. Compared to a conventional GPU, HeteroCore GPU improves system throughput by 20.1% on average (up to 80.9%) and per-application performance by 29.8% on average (up to 50.3%), for workload mixes composed of kernels with different TLP-resource demands.

show abstract

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Cited by 36 publications

References 33 publications

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Fast Distance Fields for Fluid Dynamics Mesh Generation on Graphics Hardware

HeteroCore GPU to Exploit TLP-Resource Diversity

Contact Info

Product

Resources

About