Proceedings of the 48th International Symposium on Microarchitecture 2015
DOI: 10.1145/2830772.2830796
|View full text |Cite
|
Sign up to set email alerts
|

Efficient warp execution in presence of divergence with collaborative context collection

Abstract: GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet powerefficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software (compiler) technique named Collaborative Context Collection (CCC) that increases the warp execution efficiency… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
14
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
6
2
1

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(14 citation statements)
references
References 53 publications
0
14
0
Order By: Relevance
“…CCC 18 tried to increase warp execution efficiency when each thread process repetitive tasks with divergent paths, such as a loop contains if‐else statements in the loop body. The main idea is to gather enough iterations that enters the same direction of the if‐else statement and execute them at once.…”
Section: Related Workmentioning
confidence: 99%
“…CCC 18 tried to increase warp execution efficiency when each thread process repetitive tasks with divergent paths, such as a loop contains if‐else statements in the loop body. The main idea is to gather enough iterations that enters the same direction of the if‐else statement and execute them at once.…”
Section: Related Workmentioning
confidence: 99%
“…Each of these two warps have partially active lanes, and the warps have to be executed one after another. On completion of the execution of both paths, the warps rejoin to continue normal execution as a single warp [52,53].…”
Section: Partial-lanementioning
confidence: 99%
“…For applications that seldom use the shared memory, the shared memory can also be used to store temporary context information. For example, to compact divergent threads, the relevant registers of divergent threads can be collected in a warp-specific stack allocated in the shared memory and restores the registers only when the perfect utilization of warp lanes becomes feasible [15]. To maximize the thread parallelism by assigning threads up to the register file limit instead of the scheduling limit [37], the context information of thread blocks that are currently not considered for scheduling can be stored in the shared memory temporarily.…”
Section: Using Unused Shared Memory To Store Context Informationmentioning
confidence: 99%