Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism

Wu, Chao‐Chin; Ke, Jenn-Yang; Lin, Heshan; Feng, Wu-chun

doi:10.1109/icpads.2011.92

Cited by 17 publications

(6 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…(2) and selecting the appropriate computing power for each phase lead to better mapping for the entire NPDP computation and maximum utilization of SMs in all the phases. In comparisons of these results with earlier state-of-art by Wu et al [19], the maximum achieved speedup was 13.40×, whereas speedup using GMM approach is more than 30×.…”

Section: Resultsmentioning

confidence: 62%

“…On the same line, inherent non-uniformity in the NPDP algorithms is targeted using the thread block analogy i.e. single block is employed for each subproblem and the number of threads in a block is the same as the number of comparisons required for computing the subproblem presented by Wu et al [19]. In addition, two stage adaptive thread model for the efficient mapping is illustrated by employing different number of threads for different phases.…”

Section: MCMmentioning

confidence: 99%

“…Kernel is a device function that executes on the GPU. The gridsize and block-size are the two important kernel parameters that define the number of blocks and number of threads in a block respectively, demonstrated in [40], [41], [42]. Selection of appropriate grid-size and block-size plays an important role in the task distribution of a kernel over the SMs.…”

Section: Cudamentioning

confidence: 99%

See 2 more Smart Citations

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

Diwan

Tembhurne

2019

CIT. J. comput. inf. technol

View full text Add to dashboard Cite

Parallelization of Non-Serial Polyadic Dynamic Programming (NPDP) on high-throughput manycore architectures, such as NVIDIA GPUs, suffers from load imbalance, i.e. non-optimal mapping between the sub-problems of NPDP and the processing elements of the GPU. NPDP exhibits non-uniformity in the number of subproblems as well as computational complexity across the phases. In NPDP parallelization, phases are computed sequentially whereas subproblems of each phase are computed concurrently. Therefore, it is essential to effectively map the subproblems of each phase to the processing elements while implementing thread level parallelism. We propose an adaptive Generalized Mapping Method (GMM) for NPDP parallelization that utilizes the GPU for efficient mapping of subproblems onto processing threads in each phase. Input-size and targeted GPU decide the computing power and the best mapping for each phase in NPDP parallelization. The performance of GMM is compared with different conventional parallelization approaches. For sufficiently large inputs, our technique outperforms the state-of-the-art conventional parallelization approach and achieves a significant speedup of a factor 30. We also summarize the general heuristics for achieving better gain in the NPDP parallelization.

show abstract

Section: Resultsmentioning

confidence: 62%

Section: MCMmentioning

confidence: 99%

Section: Cudamentioning

confidence: 99%

See 1 more Smart Citation

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

Diwan

Tembhurne

2019

CIT. J. comput. inf. technol

View full text Add to dashboard Cite

show abstract

“…There are several published works on the implementation of the dynamic programming [5], [6], [20], [21], [22]. Their implementations have been optimized mainly by the developer's experience.…”

Section: Introductionmentioning

confidence: 99%

A Time Optimal Parallel Algorithm for the Dynamic Programming on the Hierarchical Memory Machine

Nakano

2014

2014 Second International Symposium on Computing and Networking

View full text Add to dashboard Cite

The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of architecture of CUDA-enabled GPUs. The main contribution of this paper is to present an efficient implementation of the Ç´Ò ¿ µ-time dynamic programming algorithm for solving the optimal triangulation problem for a convex Ò-gon in the HMM. Although the HMM can run a lot of threads in parallel, it is very hard to accelerate computation involving complicated memory access such as the dynamic programming for the optimal triangulation problem. It is often the case that the acceleration rate is limited to the bandwidth Û of the global memory for problems involving complicated stride memory access. Quite surprisingly, our implementation of the dynamic programing algorithm for solving the optimal triangulation problem runs Ç´Ò ¿ Û ¾ µ time units using Ñ Ü´ÛÄ Û ¾ Ðµ threads on the HMM with bandwidth Û, global memory latency Ä and shared memory latency Ð. Hence, this parallel algorithm achieves the acceleration rate of more than Û although the dynamic programming algorithm involves complicated stride memory access. Also, we prove that this parallel algorithm is time optimal when Ä Ç´ÛÐµ.

show abstract

“…We additionally retrieve the backtrace on the GPU and transfer it to the CPU. The Nussinov and matrix multiplication problems have also been studied as pure GPU implementations [3,38].…”

Section: Metaprogramming and Compiler Technologymentioning

confidence: 99%

Staged parser combinators for efficient data processing

Jonnalagedda

Coppey

Stucki

et al. 2014

Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Amp; Applications

View full text Add to dashboard Cite

Parsers are ubiquitous in computing, and many applications depend on their performance for decoding data efficiently. Parser combinators are an intuitive tool for writing parsers: tight integration with the host language enables grammar specifications to be interleaved with processing of parse results. Unfortunately, parser combinators are typically slow due to the high overhead of the host language abstraction mechanisms that enable composition.We present a technique for eliminating such overhead. We use staging, a form of runtime code generation, to dissociate input parsing from parser composition, and eliminate intermediate data structures and computations associated with parser composition at staging time. A key challenge is to maintain support for input dependent grammars, which have no clear stage distinction.Our approach applies to top-down recursive-descent parsers as well as bottom-up nondeterministic parsers with key applications in dynamic programming on sequences, where we auto-generate code for parallel hardware. We achieve performance comparable to specialized, hand-written parsers.

show abstract

Optimizing Dynamic Programming on Graphics Processing Units via Adaptive Thread-Level Parallelism

Cited by 17 publications

References 16 publications

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

A Parallelization of Non-Serial Polyadic Dynamic Programming on GPU

A Time Optimal Parallel Algorithm for the Dynamic Programming on the Hierarchical Memory Machine

Staged parser combinators for efficient data processing

Contact Info

Product

Resources

About