2011
DOI: 10.1007/978-3-642-21878-1_46
|View full text |Cite
|
Sign up to set email alerts
|

Static GPU Threads and an Improved Scan Algorithm

Abstract: Current GPU programming systems automatically distribute the work on all GPU processors based on a set of fixed assumptions, e. g. that all tasks are independent from each other. We show that automatic distribution limits algorithmic design, and demonstrate that manual work distribution hardly adds any overhead. Our Scan + algorithm is an improved scan relying on manual work distribution. It uses global barriers and task interleaving to provides almost twice the performance of Apple's reference implementation … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
19
0

Year Published

2011
2011
2024
2024

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 12 publications
(19 citation statements)
references
References 3 publications
0
19
0
Order By: Relevance
“…Further improvements include exploiting producer-consumer locality via local shared memory [Satish et al 2009;Breitbart 2011;Yan et al 2013]. The persistent threads model can further be extended to support GPU-wide synchronization via message passing [Stuart and Owens 2009;Luo et al 2010;Xiao and Feng 2010] and dynamic priorities [Steinberger et al 2012].…”
Section: Related Workmentioning
confidence: 99%
“…Further improvements include exploiting producer-consumer locality via local shared memory [Satish et al 2009;Breitbart 2011;Yan et al 2013]. The persistent threads model can further be extended to support GPU-wide synchronization via message passing [Stuart and Owens 2009;Luo et al 2010;Xiao and Feng 2010] and dynamic priorities [Steinberger et al 2012].…”
Section: Related Workmentioning
confidence: 99%
“…In this case recursive partitioning of the array is needed, and thus can incur further implementation complexity and performance overheads. This issue can be addressed by fixing the size of the intermediate array by using static threads [24].…”
Section: Inter-block Orchestration Mechanismmentioning
confidence: 99%
“…In this paper we refer it to Merrill_Scan. Jens Breitbart [24] proposed a scan algorithm on GPUs with a fixed number of threads. Nan Zhang [14] proposed a novel parallel scan for multicore processors.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Static WGs were first introduced by one of the authors in [3], and are a kind of container to which these tasks can be mapped. A static WG is assigned to a CU, where it runs the tasks in a user-defined sequential order without preemption.…”
Section: Static Work-groupsmentioning
confidence: 99%