Thread block compaction for efficient SIMT control flow

Fung, Wilson Wai Lun; Aamodt, Tor M.

doi:10.1109/hpca.2011.5749714

Cited by 137 publications

(138 citation statements)

References 16 publications

Supporting

Mentioning

134

Contrasting

Unclassified

Order By: Relevance

“…In this paper we address a significant issue with previously proposed compaction mechanisms [8,19] that hinders their effectiveness. In order to identify candidates for compaction, hardware stalls all warps within a CTA on any potentially divergent branch until all warps reach the branch point.…”

Section: Introductionmentioning

confidence: 99%

“…This is because only a single instruction is issued to all SIMD lanes, implying that only a subset of the lanes should actually execute operations and commit results. Recent research has shown that the impact of this control divergence problem can be reduced by dynamically forming SIMD-instructions from large collections of threads [8,19]. These collections of threads are called cooperating thread arrays (CTAs) or thread blocks by NVIDIA's CUDA [21] and workgroups by OpenCL [3].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Rhu¹,

Erez²

2012

2012 39th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

Wide SIMD-based GPUs have evolved into a promising platform for running general purpose workloads. Current programmable GPUs allow even code with irregular control to execute well on their SIMD pipelines. To do this, each SIMD lane is considered to execute a logical thread where hardware ensures that control flow is accurate by automatically applying masked execution. The masked execution, however, often degrades performance because the issue slots of masked lanes are wasted. This degradation can be mitigated by dynamically compacting multiple unmasked threads into a single SIMD unit. This paper proposes a fundamentally new approach to branch compaction that avoids the unnecessary synchronization required by previous techniques and that only stalls threads that are likely to benefit from compaction. Our technique is based on the compaction-adequacy predictor (CAPRI). CAPRI dynamically identifies the compactioneffectiveness of a branch and only stalls threads that are predicted to benefit from compaction. We utilize a simple single-level branch-predictor inspired structure and show that this simple configuration attains a prediction accuracy of 99.8% and 86.6% for non-divergent and divergent workloads, respectively. Our performance evaluation demonstrates that CAPRI consistently outperforms both the baseline design that never attempts compaction and prior work that stalls upon all divergent branches.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Rhu¹,

Erez²

2012

2012 39th Annual International Symposium on Computer Architecture (ISCA)

View full text Add to dashboard Cite

show abstract

“…Un mécanisme approchant a été proposé en 2011 dans le cadre des architectures SIMT (W. Fung, Aamodt, 2011). Dans cette proposition, la logique de gestion de la divergence est réalisée en matériel plutôt qu'en logiciel, et repose sur des masques plutôt que des PC multiples.…”

Section: C* Pour Hypercubeunclassified

Reconvergence de contrôle implicite pour les architectures SIMT

Brunie¹,

Collange²

2013

Techniques et sciences informatiques

View full text Add to dashboard Cite

RÉSUMÉ. Les architectures parallèles qui obéissent au modèle SIMT telles que les GPU tirent parti de la régularité des applications en exécutant plusieurs threads concurrents sur des unités SIMD de manière synchrone. Lorsque les threads empruntent des chemins divergents dans le graphe de flot de contrôle, leur exécution est séquentialisée jusqu'au prochain point de convergence. La reconvergence doit être effectuée au plus tôt de manière à maximiser l'occupation des unités SIMD. Nous proposons dans cet article deux techniques permettant de traiter la divergence de contrôle en SIMT et d'identifier dynamiquement les points de reconvergence, dont une qui opère en espace constant et gère les sauts indirects et la récursivité. Nous évaluons une réalisation matérielle consistant à partager le matériel existant de l'unité de gestion de la divergence mémoire. En termes de performances, cette solution est au moins aussi efficace que les techniques de l'état de l'art employés par les GPU actuels.ABSTRACT. Parallel architectures following the SIMT model such as GPUs benefit from application regularity by issuing concurrent threads running in lockstep on SIMD units. As threads take different paths across the control-flow graph, lockstep execution is partially lost, and must be regained whenever possible in order to maximize the occupancy of SIMD units. In this paper, we propose two techniques to handle SIMT control divergence and identify reconvergence points. The most advanced one operates in constant space and handles indirect jumps and recursion. We evaluate a hardware implementation which leverage the existing memory divergence management unit. In terms of performance, this solution is at least as efficient as state of the art techniques in use in current GPUs.MOTS-CLÉS : Reconvergence de flot de contrôle, SIMD, SIMT, GPU

show abstract

“…Different solutions have been proposed at the software [41] and hardware levels [12][13] [35]. Here, we propose to use the scalar unit to eliminate control divergence at run time.…”

Section: Collaborative Execution Paradigm Ii: Control Divergencementioning

confidence: 99%

A Case for a Flexible Scalar Unit in SIMT Architecture

Yang

Xiang

Mantor

et al. 2014

2014 IEEE 28th International Parallel and Distributed Processing Symposium

View full text Add to dashboard Cite

Abstract-The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMTstyle instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.

show abstract

Thread block compaction for efficient SIMT control flow

Cited by 137 publications

References 16 publications

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

CAPRI: Prediction of compaction-adequacy for handling control-divergence in GPGPU architectures

Reconvergence de contrôle implicite pour les architectures SIMT

A Case for a Flexible Scalar Unit in SIMT Architecture

Contact Info

Product

Resources

About