A parallel two-list algorithm for the knapsack problem

Lou, Der‐Chyuan; Chang, Chin‐Chen

doi:10.1016/s0167-8191(96)00085-3

Cited by 27 publications

(16 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Thus, the following two lemmas are introduced. Here, we do not provide the proof of these two lemmas, because the proof of these two lemmas is similar to that of Lemmas1 and 2 in [12].…”

Section: 22mentioning

confidence: 99%

A Novel CPU-GPU Cooperative Implementation of A Parallel Two-List Algorithm for the Subset-Sum Problem

Wan

Liu

et al. 2014

Proceedings of Programming Models and Applications on Multicores and Manycores

View full text Add to dashboard Cite

The subset-sum problem is a well-known non-deterministic polynomial-time complete (NP-complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two-list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU-GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide-and-conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector-based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation.well-known approach is the dynamic programming algorithm [7], which solves SSP in pseudopolynomial time, but it has exponential time complexity when the knapsack capacity is large. A tremendous improvement was made by Horowitz and Sahni [6], who developed a new technique that solves SSP in time O.n2 n=2 / with O.2 n=2 / memory space. The new technique is known as the two-list algorithm. On the basis of the two-list algorithm, Schroeppel and Shamir [8] proposed the two-list four-table algorithm, which needs the same time O.n2 n=2 / and less memory space O.2 n=4 / to solve SSP. Although many sequential algorithms have been designed to solve SSP in the past, Horowitz and Sahni's two-list algorithm continues to be the best known sequential algorithm.With the advent of parallel computing, a large effort has been made to reduce the computation time of SSP. Karnin [9] proposed a parallel algorithm that parallelizes the generation routine of the two-list four-table algorithm [8] using O.2 n=6 / processors and O.2 n=6 / memory cells in time O.n2 n=2 /. Ferreira [10] presented a brilliant parallel two-list algorithm that solves SSP in time O.n.2 n=2 / " / with O..2 n=2 / 1 " / processors and O.2 n=2 / memory space, where 0 6 " 6 1. Chang et al. [11] introduced a parallel algorithm that parallelizes the generation stage of Horowitz and Sahni's two-list algorithm [6]. They claimed that their parallel generation stage can be accomplished in time O..n=8/ 2 / with O.2 n=8 / processors and O.2 n=4 / memory space. On the basis of the generation technique of Chang et al., Lou and Chang [12] successfully parallelized the search stage of Horowitz and Sahni's two-list algorit...

show abstract

“…Thus, the following two lemmas are introduced. Here, we do not provide the proof of these two lemmas, because the proof of these two lemmas is similar to that of Lemmas1 and 2 in [12].…”

Section: 22mentioning

confidence: 99%

A Novel CPU-GPU Cooperative Implementation of A Parallel Two-List Algorithm for the Subset-Sum Problem

Wan

Liu

et al. 2014

Proceedings of Programming Models and Applications on Multicores and Manycores

View full text Add to dashboard Cite

show abstract

“…They claimed that their parallel generation stage can be accomplished in time O (( n ∕ 8) 2 ) with O (2 n ∕ 8 ) processors and O (2 n ∕ 4 ) memory space. On the basis of the generation technique of Chang et al ., Lou and Chang successfully parallelized the search stage of Horowitz and Sahni's two ‐ list algorithm using O (2 n ∕ 8 ) processors and O (2 n ∕ 4 ) memory space in time O (2 3 n ∕ 8 ). Unfortunately, the analysis of the time complexity of the algorithm of Chang et al ., was proved to be wrong by Sanches et al .…”

Section: Introductionmentioning

confidence: 99%

GPU implementation of a parallel two‐list algorithm for the subset‐sum problem

Wan

Liu

et al. 2014

Concurrency and Computation

View full text Add to dashboard Cite

SUMMARYThe subset‐sum problem is a well‐known non‐deterministic polynomial‐time complete (NP‐complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two‐list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU–GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide‐and‐conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector‐based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation. Copyright © 2014 John Wiley & Sons, Ltd.

show abstract

“…see [2], [3] and [4]). In particular, implementations on a SIMD machine have been performed on a 4K processor ICL DAP [5], a 16K Connection Machine CM-2 (see [6] and [7]) and a 4K MasPar MP-1 machine (see [7]).…”

Section: Introductionmentioning

confidence: 99%