SIMD divergence optimization through intra-warp compaction

Vaidya, Aniruddha S.; Shayesteh, Anahita; Woo, Dong Hyuk; Saharoy, Roy; Azimi, Mani

doi:10.1145/2508148.2485954

Cited by 12 publications

(16 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Later, Aamodt [34] improves dual-path to multi-path micro-architecture that decouples divergence and reconvergence tracking. There are also proposals that optimize the SIMD divergence through compaction [35] [36]. However, the above hardware-based techniques require hardware changes and inevitably add complexity and hardware cost in the register file, scheduling logic, etc.…”

Section: Related Workmentioning

confidence: 99%

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

Liang

Satria

Rupnow

et al. 2016

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

Graphic processing units (GPUs) are composed of a group of single-instruction multiple data (SIMD) streaming multiprocessors (SMs). GPUs are able to efficiently execute highly data parallel tasks through SIMD execution on the SMs. However, if those threads take diverging control paths, all divergent paths are executed serially. In the worst case, every thread takes a different control path and the highly parallel architecture is used serially by each thread. This control flow divergence problem is well known in GPU development; code transformation, memory access redirection, and data layout reorganization are commonly used to reduce the impact of divergence. These techniques attempt to eliminate divergence by grouping together threads or data to ensure identical behavior. However, prior efforts using these techniques do not model the performance impact of any particular divergence or consider that complete elimination of divergence may not be possible. Thus, we perform analysis of the performance impact of divergence and potential thread regrouping algorithms that eliminate divergence or minimize the impact of remaining divergence. Finally, we develop a divergence optimization framework that analyzes and transforms the kernel at compile-time and regroups the threads at run-time. For the compute-bound applications, our proposed metrics achieve performance estimation accuracy within 6.2% of measured performance. Using these metrics, we develop thread regrouping algorithms, which consider the impact of divergence, and speed up these applications by 2.2X on average on NVIDIA GTX480.

show abstract

Section: Related Workmentioning

confidence: 99%

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

Liang

Satria

Rupnow

et al. 2016

IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.

View full text Add to dashboard Cite

show abstract

“…TSIMT takes a different approach that avoids this problem almost entirely at the expense of higher issue throughput requirements. [Vaidya et al 2013] proposed an architecture in which 16-wide SIMD instructions are executed over multiple cycles on 4-wide SIMD units. Two techniques are proposed to accelerate execution when only a subset of threads is active: Basic Cycle Compression (BCC), where SIMD subwords are skipped if no thread is active, and a more costly but also more powerful technique called Swizzled Cycle Compression (SCC) that employs crossbars to permute the operands prior to compaction to enable a more efficient compaction.…”

Section: Related Workmentioning

confidence: 99%

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Lucas

Andersch

Alvarez-Mesa

et al. 2015

ACM Trans. Archit. Code Optim.

View full text Add to dashboard Cite

Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. Therefor we present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal and spatial SIMT, named Spatio-Temporal SIMT (STSIMT). Simulations show that TSIMT alone results in a performance reduction but a combination of Scalarization and STSIMT yields a mean performance enhancement of 19.6% and improve the energy-delay-product by 26.2% compared to SIMT.

show abstract

“…This is attributed to the divergence optimization called "Basic Cycle Compression" for the Intel GPU [13]. Within this approach, if 4 contiguous workitems are not divergent within a warp (either 16 or 8 workitems), the inactivated cycles caused by the divergence within the warp can be compressed.…”

Section: A Effectiveness Of Divergence Mitigationmentioning

confidence: 99%

Parallel H.264/AVC Motion Compensation for GPUs Using OpenCL

Wang

Alvarez-Mesa

Ching

et al. 2015

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Abstract-Motion compensation is one of the most computeintensive parts in H.264/AVC video decoding. It exposes massive parallelism which can reap the benefit from Graphics Processing Units (GPUs). Control and memory divergence, however, may lead to performance penalties on GPUs. In this paper, we propose two GPU motion compensation kernels, implemented with OpenCL, that mitigate the divergence effect. In addition, the motion compensation kernels have been integrated into a complete and optimized H.264/AVC decoder that supports H.264/AVC high profile. We evaluated our kernels on GPUs with different architectures from AMD, Intel, and Nvidia. Compared to the fastest CPU used in this paper, our kernel achieves 2.0 speedup on a discrete Nvidia GPU at kernel level. However, when the overheads of memory copy and OpenCL runtime are included, no speedup is gained at application level.

show abstract

SIMD divergence optimization through intra-warp compaction

Cited by 12 publications

References 16 publications

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

An Accurate GPU Performance Model for Effective Control Flow Divergence Optimization

Spatiotemporal SIMT and Scalarization for Improving GPU Efficiency

Parallel H.264/AVC Motion Compensation for GPUs Using OpenCL

Contact Info

Product

Resources

About