Static GPU Threads and an Improved Scan Algorithm

Breitbart, Jens

doi:10.1007/978-3-642-21878-1_46

Cited by 12 publications

(19 citation statements)

References 3 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Further improvements include exploiting producer-consumer locality via local shared memory [Satish et al 2009;Breitbart 2011;Yan et al 2013]. The persistent threads model can further be extended to support GPU-wide synchronization via message passing [Stuart and Owens 2009;Luo et al 2010;Xiao and Feng 2010] and dynamic priorities [Steinberger et al 2012].…”

Section: Related Workmentioning

confidence: 99%

Whippletree

et al. 2014

View full text Add to dashboard Cite

In this paper, we present Whippletree, a novel approach to scheduling dynamic, irregular workloads on the GPU. We introduce a new programming model which offers the simplicity and expressiveness of task-based parallelism while retaining all aspects of the multi-level execution hierarchy essential to unlocking the full potential of a modern GPU. At the same time, our programming model lends itself to efficient implementation on the SIMD-based architecture typical of a current GPU. We demonstrate the practical utility of our model by providing a reference implementation on top of current CUDA hardware. Furthermore, we show that our model compares favorably to traditional approaches in terms of both performance as well as the range of applications that can be covered. We demonstrate the benefits of our model for recursive Reyes rendering, procedural geometry generation and volume rendering with concurrent irradiance caching.

show abstract

Section: Related Workmentioning

confidence: 99%

Whippletree

et al. 2014

View full text Add to dashboard Cite

show abstract

“…In this case recursive partitioning of the array is needed, and thus can incur further implementation complexity and performance overheads. This issue can be addressed by fixing the size of the intermediate array by using static threads [24].…”

Section: Inter-block Orchestration Mechanismmentioning

confidence: 99%

“…In this paper we refer it to Merrill_Scan. Jens Breitbart [24] proposed a scan algorithm on GPUs with a fixed number of threads. Nan Zhang [14] proposed a novel parallel scan for multicore processors.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

StreamScan

Yan

Long

Zhang

2013

Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

View full text Add to dashboard Cite

Scan (also known as prefix sum) is a very useful primitive for various important parallel algorithms, such as sort, BFS, SpMV, compaction and so on. Current state of the art of GPU based scan implementation consists of three consecutive Reduce-Scan-Scan phases. This approach requires at least two global barriers and 3N (N is the problem size) global memory accesses. In this paper we propose StreamScan, a novel approach to implement scan on GPUs with only one computation phase. The main idea is to restrict synchronization to only adjacent workgroups, and thereby eliminating global barrier synchronization completely. The new approach requires only 2N global memory accesses and just one kernel invocation. On top of this we propose two important optimizations to further boost performance speedups, namely thread grouping to eliminate unnecessary local barriers, and register optimization to expand the on chip problem size. We designed an auto-tuning framework to search the parameter space automatically to generate highly optimized codes for both AMD and Nvidia GPUs. We implemented our technique with OpenCL. Compared with previous fast scan implementations, experimental results not only show promising performance speedups, but also reveal dramatic different optimization tradeoffs between Nvidia and AMD GPU platforms.

show abstract

“…Static WGs were first introduced by one of the authors in [3], and are a kind of container to which these tasks can be mapped. A static WG is assigned to a CU, where it runs the tasks in a user-defined sequential order without preemption.…”

Section: Static Work-groupsmentioning

confidence: 99%

Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions

Breitbart

Fohry

2011

IJNC

Self Cite

View full text Add to dashboard Cite

Current processor architectures are diverse and heterogeneous. Examples include multicore chips, GPUs and the Cell Broadband Engine (CBE). The recent Open Compute Language (OpenCL) standard aims at efficiency and portability. This paper explores its efficiency when implemented on the CBE, without using CBE-specific features such as explicit asynchronous memory transfers. We based our experiments on two applications: matrix multiplication, and the client side of the Einstein@Home distributed computing project. Both were programmed in OpenCL, and then translated to the CBE. For matrix multiplication, we deployed different levels of OpenCL performance optimization, and observed that they pay off on the CBE. For Einstein@Home, our translated OpenCL version achieves almost the same speed as a native CBE version. We experimented with two versions of the OpenCL to CBE mapping, in which the PPE component of the CBE does or does not take the role of a compute unit.Another major contribution of the paper is a proposal for two OpenCL extensions that we analyzed for both CBE and NVIDIA GPUs. First, we suggest an additional memory level in OpenCL, called static local memory. With little programming expense, it can lead to significant speedups such as for reduction a factor of seven on the CBE and about 20% on NVIDIA GPUs. Second, we introduce static work-groups to support user-defined mappings of tasks. Static work-groups may simplify programming and lead to speedups of 35% (CBE) and 100% (GPU) for all-parallel-prefix-sums.

show abstract

Static GPU Threads and an Improved Scan Algorithm

Cited by 12 publications

References 3 publications

Whippletree

Whippletree

StreamScan

Analyzing Use of OpenCL on the Cell Broadband Engine and a Proposal for OpenCL Extensions

Contact Info

Product

Resources

About