Divergence Analysis and Optimizations

Coutinho, Bruno; Sampaio, Diogo; Pereira, Fernando Magno Quintão; Meira, Wagner

doi:10.1109/pact.2011.63

Cited by 78 publications

(73 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…[15] N Y N Y Boyer et. al [16] Y N N -Coutinho et al [17] N Y N Y Lungu et. al [18] N Y Y N Thomas and Daruwala [19] Y N N Y…”

Section: Related Workmentioning

confidence: 99%

A Systematic Method for Detecting Parallelized Software Bottlenecks and Suggesting Modifications: The Case of the Expectation Maximization Algorithm

Oliveira¹,

Chella²,

Macedo³

et al. 2017

JSW

View full text Add to dashboard Cite

Parallelized algorithms can distribute the workload on the available multi-core processors. Graphical Processing Units (GPU) began to be used in general purpose computing thanks to its ability to simultaneously perform thousands of operations in their parallel coprocessors. Unfortunately, providing parallelized versions of typical sequential routines is not a trivial task. Even with the advent of CUDA, the NVIDIA's more intuitive solution for GPU programming, developers need to acquire a deep knowledge of GPU architecture and the rationale of the target algorithms to optimize resources usage and reduce processing time. This paper proposes a systematic method for analyzing parallelized algorithms and propose guidelines for CUDA code refactoring in such a way faster and more efficient software, regarding hardware resources consumption, could be constructed. One of such kind of software is Automatic Speech Recognition (ASR) systems. Mainstream approaches for ASR use the Expectation Maximization (EM) algorithm to train Gaussian Mixture Models (GMM) to provide an Acoustic Model for ASR. These training phase is usually extensive time-consuming and so it's well suited for a parallelized solution approach. We show the feasibility of our method identifying important issues in a literature's parallelized implementation of EM and further refactoring suggestion to enhance memory occupancy and decrease processing time. The results show a processing speedup of the EM algorithm around 40x (minimum) and 61x (maximum) when compared to the control version. The method was also effective in the improvement of the values for all the concerned performance metrics for GPU-based solutions.

show abstract

“…[15] N Y N Y Boyer et. al [16] Y N N -Coutinho et al [17] N Y N Y Lungu et. al [18] N Y Y N Thomas and Daruwala [19] Y N N Y…”

Section: Related Workmentioning

confidence: 99%

A Systematic Method for Detecting Parallelized Software Bottlenecks and Suggesting Modifications: The Case of the Expectation Maximization Algorithm

Oliveira¹,

Chella²,

Macedo³

et al. 2017

JSW

View full text Add to dashboard Cite

show abstract

“…Coutinho et al [7] describe what they call "divergence analysis," which is also an extension of the approach of [26]. Their analysis finds divergent values, by first converting SSA information into gated single assignment [22], and then replacing control-flow merges with a predicate select operator.…”

Section: Related Workmentioning

confidence: 99%

“…Collange [5] presents work with goals similar to ours, but uses an approach like that described in [7]. Collange does not use a gated representation but instead performs a symbolic analysis on a lattice of tags, which encodes and tracks alignment of various instruction operands.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Convergence and scalarization for data-parallel architectures

Asanović

Keckler

Lee

et al. 2013

Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

View full text Add to dashboard Cite

Modern throughput processors such as GPUs achieve high performance and efficiency by exploiting data parallelism in application kernels expressed as threaded code. One drawback of this approach compared to conventional vector architectures is redundant execution of instructions that are common across multiple threads, resulting in energy inefficiency due to excess instruction dispatch, register file accesses, and memory operations. This paper proposes to alleviate these overheads while retaining the threaded programming model by automatically detecting the scalar operations and factoring them out of the parallel code. We have developed a scalarizing compiler that employs convergence and variance analyses to statically identify values and instructions that are invariant across multiple threads. Our compiler algorithms are effective at identifying convergent execution even in programs with arbitrary control flow, identifying two-thirds of the opportunity captured by a dynamic oracle. The compile-time analysis leads to a reduction in instructions dispatched by 29%, register file reads and writes by 31%, memory address counts by 47%, and data access counts by 38%.

show abstract

“…This constraint results in unique tradeoffs and optimization opportunities for the divergence management architecture. Finally (5), in SPMD programs, uniform control and data operations can be scalarized to improve efficiency, a challenge that is related to but different than vectorization [5,12,15].…”

Section: Introductionmentioning

confidence: 99%

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Lee

Grover²,

Krashinsky³

et al. 2014

2014 47th Annual IEEE/ACM International Symposium on Microarchitecture

View full text Add to dashboard Cite

Abstract-Data-parallel architectures must provide efficient support for complex control-flow constructs to support sophisticated applications coded in modern single-program multipledata languages. As these architectures have wide datapaths that process a single instruction across parallel threads, a mechanism is needed to track and sequence threads as they traverse potentially divergent control paths through the program. The design space for divergence management ranges from softwareonly approaches where divergence is explicitly managed by the compiler, to hardware solutions where divergence is managed implicitly by the microarchitecture. In this paper, we explore this space and propose a new predication-based approach for handling control-flow structures in data-parallel architectures. Unlike prior predication algorithms, our new compiler analyses and hardware instructions consider the commonality of predication conditions across threads to improve efficiency. We prototype our algorithms in a production compiler and evaluate the tradeoffs between software and hardware divergence management on current GPU silicon. We show that our compiler algorithms make a predication-only architecture competitive in performance to one with hardware support for tracking divergence.

show abstract

Divergence Analysis and Optimizations

Cited by 78 publications

References 31 publications

A Systematic Method for Detecting Parallelized Software Bottlenecks and Suggesting Modifications: The Case of the Expectation Maximization Algorithm

A Systematic Method for Detecting Parallelized Software Bottlenecks and Suggesting Modifications: The Case of the Expectation Maximization Algorithm

Convergence and scalarization for data-parallel architectures

Exploring the Design Space of SPMD Divergence Management on Data-Parallel Architectures

Contact Info

Product

Resources

About