Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Wu, Bo; Zhao, Zhijia; Zhang, Eddy Zheng; Jiang, Yunlian; Shen, Xipeng

doi:10.1145/2442516.2442523

Cited by 69 publications

(29 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In the GPU implementation of Andersen's analysis described in [22], a 32-word element is used, where the bits field spans 30 words (960 bits). This helps mitigate intra-warp divergence, since all 32 threads in a warp can perform operations (e.g., a coalesced global memory access [25] and a bitwise OR for ∪) on the 32 words in parallel. In a constraint graph G = (V, E), the variables, i.e., nodes in V are mapped to consecutive integers, starting from 0.…”

Section: Sparse Bit Vectorsmentioning

confidence: 99%

An Efficient GPU Implementation of Inclusion-Based Pointer Analysis

Su¹,

Ye²,

Xue³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

We present an efficient GPU implementation of Andersen's whole-program inclusion-based pointer analysis, a fundamental analysis on which many others are based, including optimising compilers, bug detection and security analyses. Andersen's algorithm makes extensive modifications to the graph that represents the pointer-manipulating statements in a program. These modifications are highly irregular, input-dependent and statically unpredictable, making it much more challenging to balance such graph workloads across a multitude of GPU cores than those dealt with by traditional graph algorithms such as DFS and BFS. To parallelise Andersen's analysis efficiently on GPUs, we introduce an imbalance-aware workload partitioning scheme that divides its workload dynamically among the concurrent warps, initially in a warp-centric manner (during the coarse-grain stage) but later switches to a task-pool-based model when a workload imbalance is detected (during the fine-grain stage). We improve further its performance by using an adaptive group propagation scheme to reduce some redundant traversals. For a set of 14 C benchmarks evaluated, our parallel implementation of Andersen's analysis achieves a significant speedup of 46% on average over the state-of-the art on an NVIDIA Tesla K20c GPU.

show abstract

Section: Sparse Bit Vectorsmentioning

confidence: 99%

An Efficient GPU Implementation of Inclusion-Based Pointer Analysis

Su¹,

Ye²,

Xue³

et al. 2016

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…As Table 1 shows, these programs come from four benchmark suites, cover a broad set of domains, and include a similar number of regular and irregular programs. Those irregular benchmarks impose special challenges for GPGPU optimization, and have drawn a lot of attention from the community recently [6,24,29,30,43,45].…”

Section: Benchmarksmentioning

confidence: 99%

“…Many studies have been proposed to improve GPU memory performance, including data placement in memory [8] or on chip [23], streamlining irregular memory accesses or control flows at runtime [43,45,46], bypassing L1 cache [17] and so on. The precise control of task-SM affinity enabled by SM-centric transformation opens new opportunities, as illustrated by the affinity-based scheduling.…”

Section: Related Workmentioning

confidence: 99%

Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

Chen

Liu

et al. 2015

Proceedings of the 29th ACM on International Conference on Supercomputing

Self Cite

View full text Add to dashboard Cite

A GPU's computing power lies in its abundant memory bandwidth and massive parallelism. However, its hardware thread schedulers, despite being able to quickly distribute computation to processors, often fail to capitalize on program characteristics effectively, achieving only a fraction of the GPU's full potential. Moreover, current GPUs do not allow programmers or compilers to control this thread scheduling, forfeiting important optimization opportunities at the program level. This paper presents a transformation centered on Streaming Multiprocessors (SM); this software approach to circumventing the limitations of the hardware scheduler allows flexible program-level control of scheduling. By permitting precise control of job locality on SMs, the transformation overcomes inherent limitations in prior methods.With this technique, flexible control of GPU scheduling at the program level becomes feasible, which opens up new opportunities for GPU program optimizations. The second part of the paper explores how the new opportunities could be leveraged for GPU performance enhancement, what complexities there are, and how to address them. We show that some simple optimization techniques can enhance co-runs of multiple kernels and improve data locality of irregular applications, producing 20-33% average increase in performance, system throughput, and average turnaround time.

show abstract

“…Kim and Han [14] design an algorithm to replace unnecessary gather and scatter operations by scalar operations. Wu et al [28] try to resolve a very similar problem, coalesced memory access, within the context of the GPU architecture. Focusing on inter-iteration parallelism on an irregular reduction for a SSE-like instruction set, we address this problem by a novel computation (edges data) reordering method, which we describe below.…”

Section: Data Reorganizationmentioning

confidence: 99%

“…This includes work on parallelizing stencil applications on GPUs [5,18,19,10,4]. For irregular applications on GPU, the coalesced memory access problem has also been addressed [28,29]. However, because of the differences in the architectures (e.g.…”

Section: Related Workmentioning

confidence: 99%

A programming system for xeon phis with runtime SIMD parallelization

Huo

Ren

Agrawal

2014

Proceedings of the 28th ACM International Conference on Supercomputing

View full text Add to dashboard Cite

The Intel Xeon Phi offers a promising solution to coprocessing, since it is based on the popular x86 instruction set. However, to fully utilize its potential, applications must be vectorized to leverage the wide SIMD lanes, in addition to effective large-scale shared memory parallelism. Compared to the SIMT execution model on GPGPUs with CUDA or OpenCL, SIMD parallelism with a SSE-like instruction set imposes many restrictions, and has generally not benefitted applications involving branches, irregular accesses, or even reductions in the past. In this paper, we consider the problem of accelerating applications involving different communication patterns on Xeon Phis, with an emphasis on effectively using available SIMD parallelism. We offer an API for both shared memory and SIMD parallelization, and demonstrate its implementation. We use implementations of overloaded functions as a mechanism for providing SIMD code, which is assisted by runtime data reordering and our methods to effectively manage control flow. Our extensive evaluation with 6 popular applications shows large gains over the SIMD parallelization achieved by the production (ICC) compiler, and we even outperform OpenMP for MIMD parallelism.

show abstract

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

Cited by 69 publications

References 22 publications

An Efficient GPU Implementation of Inclusion-Based Pointer Analysis

An Efficient GPU Implementation of Inclusion-Based Pointer Analysis

Enabling and Exploiting Flexible Task Assignment on GPU through SM-Centric Program Transformations

A programming system for xeon phis with runtime SIMD parallelization

Contact Info

Product

Resources

About