2013
DOI: 10.1145/2508148.2485954
|View full text |Cite
|
Sign up to set email alerts
|

SIMD divergence optimization through intra-warp compaction

Abstract: SIMD execution units in GPUs are increasingly used for high performance and energy efficient acceleration of general purpose applications. However, SIMD control flow divergence effects can result in reduced execution efficiency in a class of GPGPU applications, classified as divergent applications. Improving SIMD efficiency, therefore, has the potential to bring significant performance and energy benefits to a wide range of such data parallel applications. Recently, the SIMD divergence problem has re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2013
2013
2024
2024

Publication Types

Select...
4
3

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(16 citation statements)
references
References 16 publications
0
16
0
Order By: Relevance
“…Later, Aamodt [34] improves dual-path to multi-path micro-architecture that decouples divergence and reconvergence tracking. There are also proposals that optimize the SIMD divergence through compaction [35] [36]. However, the above hardware-based techniques require hardware changes and inevitably add complexity and hardware cost in the register file, scheduling logic, etc.…”
Section: Related Workmentioning
confidence: 99%
“…Later, Aamodt [34] improves dual-path to multi-path micro-architecture that decouples divergence and reconvergence tracking. There are also proposals that optimize the SIMD divergence through compaction [35] [36]. However, the above hardware-based techniques require hardware changes and inevitably add complexity and hardware cost in the register file, scheduling logic, etc.…”
Section: Related Workmentioning
confidence: 99%
“…TSIMT takes a different approach that avoids this problem almost entirely at the expense of higher issue throughput requirements. [Vaidya et al 2013] proposed an architecture in which 16-wide SIMD instructions are executed over multiple cycles on 4-wide SIMD units. Two techniques are proposed to accelerate execution when only a subset of threads is active: Basic Cycle Compression (BCC), where SIMD subwords are skipped if no thread is active, and a more costly but also more powerful technique called Swizzled Cycle Compression (SCC) that employs crossbars to permute the operands prior to compaction to enable a more efficient compaction.…”
Section: Related Workmentioning
confidence: 99%
“…This is attributed to the divergence optimization called "Basic Cycle Compression" for the Intel GPU [13]. Within this approach, if 4 contiguous workitems are not divergent within a warp (either 16 or 8 workitems), the inactivated cycles caused by the divergence within the warp can be compressed.…”
Section: A Effectiveness Of Divergence Mitigationmentioning
confidence: 99%