NQueens on CUDA: Optimization Issues

Feinbube, Frank; Rabe, Bernhard; Löwis, Martin von; Polze, Andreas

doi:10.1109/ispdc.2010.22

Cited by 18 publications

(38 citation statements)

References 1 publication

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Image source: [18] 22, 317, 699, 616, 364, 044 valid solutions were determined for N = 26. While the world record is held by the aforementioned FPGA-based approach, graphics hardware based implementations have also been well researched [8], [5,23]. However, no publication is known to the authors that tries to solve the unbalanced workload distribution of N-Queens using Dynamic Parallelism.…”

Section: Parallel Implementationsmentioning

confidence: 99%

“…However, no publication is known to the authors that tries to solve the unbalanced workload distribution of N-Queens using Dynamic Parallelism. The fastest implementation for GPU compute devices known to the authors has been published by Feinbube et al [8], which is based on Somers [21] serial implementation. In this approach, the main CPU creates initial board configurations for each thread and then hands over the tasks to the GPU.…”

Section: Parallel Implementationsmentioning

confidence: 99%

“…The partial results retrieved for each subtree are finally consolidated by the main CPU. The implementation [8] incorporates several GPU-specific optimizations and is also used as a reference for the implementation strategies evaluated in the course of this paper.…”

Section: Parallel Implementationsmentioning

confidence: 99%

“…Finally, the concept DP-SWAP explained in Section 3.4 presents a strategy that is able to reduce the number of dynamic memory allocations in the context of Dynamic Parallelism. In Section 4, all implementations are evaluated and compared to the approaches of Somers [21] and Feinbube et al [8]. Like these approaches, all strategies presented in the course of this work exploit the symmetry of N-Queens (see Figure 1).…”

Section: Approaches To Dynamic Parallelismmentioning

confidence: 99%

“…Counteracting the overhead of the large number of light-weight threads employed in DP-1, this approach refers to the approach of Feinbube et al [8] and aims at employing fewer threads and larger work packages. For that purpose, each thread applies backtracking to search a subtree up to a specified search depth.…”

Section: Dp-2: Condensed Kernel With Static Search Depthmentioning

confidence: 99%

See 4 more Smart Citations

A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads

Plauth

Feinbube

Schlegel

et al. 2016

IJNC

Self Cite

View full text Add to dashboard Cite

GPU compute devices have become very popular for general purpose computations. However, the SIMD-like hardware of graphics processors is currently not well suited for irregular workloads, like searching unbalanced trees. In order to mitigate this drawback, NVIDIA introduced an extension to GPU programming models called Dynamic Parallelism. This extension enables GPU programs to spawn new units of work directly on the GPU, allowing the refinement of subsequent work items based on intermediate results without any involvement of the main CPU.This work investigates methods for employing Dynamic Parallelism with the goal of improved workload distribution for tree search algorithms on modern GPU hardware. For the evaluation of the proposed approaches, a case study is conducted on the N-Queens problem. Extensive benchmarks indicate that the benefits of improved resource utilization fail to outweigh high management overhead and runtime limitations due to the very fine level of granularity of the investigated problem. However, novel memory management concepts for passing parameters to child grids are presented. These general concepts are applicable to other, more coarse-grained problems that benefit from the use of Dynamic Parallelism.

show abstract

Section: Parallel Implementationsmentioning

confidence: 99%

Section: Parallel Implementationsmentioning

confidence: 99%

Section: Parallel Implementationsmentioning

confidence: 99%

Section: Approaches To Dynamic Parallelismmentioning

confidence: 99%

Section: Dp-2: Condensed Kernel With Static Search Depthmentioning

confidence: 99%

See 3 more Smart Citations

A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads

Plauth

Feinbube

Schlegel

et al. 2016

IJNC

Self Cite

View full text Add to dashboard Cite

show abstract

IVM‐based parallel branch‐and‐bound using hierarchical work stealing on multi‐GPU systems

Gmys

Mezmaz

Melab

et al. 2016

Concurrency and Computation

View full text Add to dashboard Cite

International audienceTree-based exploratory methods, like Branch-and-Bound (B&B) algorithms, are highly irregular applications which makes their design and implementation on graphics processing unit (GPU) challenging. In this paper, we present a multi-GPU B&B algorithm for solving large permutation-based combinatorial optimization problems. To tackle the problem of the irregular workload, we propose a hierarchical work stealing (WS) strategy that balances the workload inside the GPU and between different GPUs and CPU cores. Our B&B is based on an Integer-Vector-Matrix data structure instead of a pool of permutations, and work units exchanged are intervals of factoradics instead of sets of nodes. Two variants of the algorithm, using the same hierarchical WS strategy, are proposed: one for combinatorial optimization problems where the evaluation of nodes is costly and one for fine-grained problems. The latter variant uses a new hypercube-based WS strategy and a trigger mechanism to balance the work load inside the GPU. The proposed approach has been extensively experimented using the flowshop scheduling, the n-queens and the asymmetric travelling salesman problems as test-cases. The reported results show that the proposed hierarchical WS mechanism is capable of handling fine and coarse-grained types of workloads efficiently, reaching near-linear speed-up on up to four GPUs for a set of ten flowshop instances and large instances of fine-grained problem

show abstract

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Pessoa

Gmys

Júnior

et al. 2017

Concurrency and Computation

View full text Add to dashboard Cite

Summary New GPGPU technologies, such as CUDA Dynamic Parallelism (CDP), can help dealing with recursive patterns of computation, such as divide‐and‐conquer, used by backtracking algorithms. In this paper, we propose a GPU‐accelerated backtracking algorithm using CDP that extends a well‐known parallel backtracking model. The search starts on CPU, processing the search tree until a first cutoff depth. Based on this partial backtracking tree, the algorithm analyzes the memory requirements of subsequent kernel generations. The proposed algorithm performs no dynamic allocation of memory on GPU, unlike related works from the literature. The proposed algorithm has been extensively tested using the N‐Queens Puzzle problem and instances of the Asymmetric Traveling Salesman Problem (ATSP) as test‐cases. The proposed CDP algorithm may, under some conditions, outperform its non‐CDP counterpart by a factor up to 25. But, it may also be up to twice slower. The CDP‐based implementation has much better worst case execution times and makes algorithm's performance less dependent on the tuning of parameters. Compared to other CDP‐based strategies from the literature, the proposed algorithm is on average 8× faster. The proposed algorithm is also hybridized with another CDP‐based strategy from the literature. The combination of strategies is in average 4.5× faster than the related strategy. We also identify some difficulties, limitations, and bottlenecks concerning the CDP programming model which may be useful for helping potential users.

show abstract

NQueens on CUDA: Optimization Issues

Cited by 18 publications

References 1 publication

A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads

A Performance Evaluation of Dynamic Parallelism for Fine-Grained, Irregular Workloads

IVM‐based parallel branch‐and‐bound using hierarchical work stealing on multi‐GPU systems

GPU‐accelerated backtracking using CUDA Dynamic Parallelism

Contact Info

Product

Resources

About