Proceedings of the 18th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming 2013
DOI: 10.1145/2442516.2442523
|View full text |Cite
|
Sign up to set email alerts
|

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
26
0
3

Year Published

2014
2014
2022
2022

Publication Types

Select...
4
2
2

Relationship

1
7

Authors

Journals

citations
Cited by 69 publications
(29 citation statements)
references
References 22 publications
0
26
0
3
Order By: Relevance
“…In the GPU implementation of Andersen's analysis described in [22], a 32-word element is used, where the bits field spans 30 words (960 bits). This helps mitigate intra-warp divergence, since all 32 threads in a warp can perform operations (e.g., a coalesced global memory access [25] and a bitwise OR for ∪) on the 32 words in parallel. In a constraint graph G = (V, E), the variables, i.e., nodes in V are mapped to consecutive integers, starting from 0.…”
Section: Sparse Bit Vectorsmentioning
confidence: 99%
“…In the GPU implementation of Andersen's analysis described in [22], a 32-word element is used, where the bits field spans 30 words (960 bits). This helps mitigate intra-warp divergence, since all 32 threads in a warp can perform operations (e.g., a coalesced global memory access [25] and a bitwise OR for ∪) on the 32 words in parallel. In a constraint graph G = (V, E), the variables, i.e., nodes in V are mapped to consecutive integers, starting from 0.…”
Section: Sparse Bit Vectorsmentioning
confidence: 99%
“…As Table 1 shows, these programs come from four benchmark suites, cover a broad set of domains, and include a similar number of regular and irregular programs. Those irregular benchmarks impose special challenges for GPGPU optimization, and have drawn a lot of attention from the community recently [6,24,29,30,43,45].…”
Section: Benchmarksmentioning
confidence: 99%
“…Many studies have been proposed to improve GPU memory performance, including data placement in memory [8] or on chip [23], streamlining irregular memory accesses or control flows at runtime [43,45,46], bypassing L1 cache [17] and so on. The precise control of task-SM affinity enabled by SM-centric transformation opens new opportunities, as illustrated by the affinity-based scheduling.…”
Section: Related Workmentioning
confidence: 99%
“…Kim and Han [14] design an algorithm to replace unnecessary gather and scatter operations by scalar operations. Wu et al [28] try to resolve a very similar problem, coalesced memory access, within the context of the GPU architecture. Focusing on inter-iteration parallelism on an irregular reduction for a SSE-like instruction set, we address this problem by a novel computation (edges data) reordering method, which we describe below.…”
Section: Data Reorganizationmentioning
confidence: 99%
“…This includes work on parallelizing stencil applications on GPUs [5,18,19,10,4]. For irregular applications on GPU, the coalesced memory access problem has also been addressed [28,29]. However, because of the differences in the architectures (e.g.…”
Section: Related Workmentioning
confidence: 99%