Improving branch divergence performance on GPGPU with a new PDOM stack and multi-level warp scheduling

Yu, Licheng; Tang, Xingsheng; Wu, Minghui; Chen, Tianzhou

doi:10.1016/j.sysarc.2013.11.008

Cited by 1 publication

(3 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BinarySearch even suffers performance degradations in the 2-wide configurations due to an increase in the IF stalls. 13 For the kernels of breadth-first search, the largest configurations only achieve 4.46% speed-up for BFS_1 and 7.95% for BFS_2 and this is only ∼1% more than the 2-wide configurations with single FUs.…”

Section: A Ilpmentioning

confidence: 95%

“…This also allows threads to issue memory operations with high spatial locality resulting in data traffic optimization in the memory hierarchy. These constraints have little effect on highly-regular graphic shader programs, but throughput can dramatically decrease in the presence of control-flow with bespoke solutions proposed to alleviate thread divergence [12] [13]. System designers have looked into building systems with many cores that are not multi-threaded [14][15], but this approach still does not address the fact that not all problems can be solved effectively in the same manner.…”

Section: A Motivationmentioning

confidence: 99%

“…In the 4-wide configurations, the IF stalls decrease again enabling these configurations to perform better than the 2-wide machines. The increase in ALUs and MULs improves performance by 13 These are stalls due to the instruction Front-end of the processor not producing a full LIW for execution by the LE1 back-end pipeline. These stalls are documented in our previous work [31] and are mostly eliminated when choosing a decoupled instruction front-end for the LE1, as this is a valid configuration option in a second generation micro-architecture.…”

Section: A Ilpmentioning

confidence: 99%

See 2 more Smart Citations