The subset-sum problem is a well-known non-deterministic polynomial-time complete (NP-complete) decision problem. This paper proposes a novel and efficient implementation of a parallel two-list algorithm for solving the problem on a graphics processing unit (GPU) using Compute Unified Device Architecture (CUDA). The algorithm is composed of a generation stage, a pruning stage, and a search stage. It is not easy to effectively implement the three stages of the algorithm on a GPU. Ways to achieve better performance, reasonable task distribution between CPU and GPU, effective GPU memory management, and CPU-GPU communication cost minimization are discussed. The generation stage of the algorithm adopts a typical recursive divide-and-conquer strategy. Because recursion cannot be well supported by current GPUs with compute capability less than 3.5, a new vector-based iterative implementation mechanism is designed to replace the explicit recursion. Furthermore, to optimize the performance of the GPU implementation, this paper improves the three stages of the algorithm. The experimental results show that the GPU implementation has much better performance than the CPU implementation and can achieve high speedup on different GPU cards. The experimental results also illustrate that the improved algorithm can bring significant performance benefits for the GPU implementation.well-known approach is the dynamic programming algorithm [7], which solves SSP in pseudopolynomial time, but it has exponential time complexity when the knapsack capacity is large. A tremendous improvement was made by Horowitz and Sahni [6], who developed a new technique that solves SSP in time O.n2 n=2 / with O.2 n=2 / memory space. The new technique is known as the two-list algorithm. On the basis of the two-list algorithm, Schroeppel and Shamir [8] proposed the two-list four-table algorithm, which needs the same time O.n2 n=2 / and less memory space O.2 n=4 / to solve SSP. Although many sequential algorithms have been designed to solve SSP in the past, Horowitz and Sahni's two-list algorithm continues to be the best known sequential algorithm.With the advent of parallel computing, a large effort has been made to reduce the computation time of SSP. Karnin [9] proposed a parallel algorithm that parallelizes the generation routine of the two-list four-table algorithm [8] using O.2 n=6 / processors and O.2 n=6 / memory cells in time O.n2 n=2 /. Ferreira [10] presented a brilliant parallel two-list algorithm that solves SSP in time O.n.2 n=2 / " / with O..2 n=2 / 1 " / processors and O.2 n=2 / memory space, where 0 6 " 6 1. Chang et al. [11] introduced a parallel algorithm that parallelizes the generation stage of Horowitz and Sahni's two-list algorithm [6]. They claimed that their parallel generation stage can be accomplished in time O..n=8/ 2 / with O.2 n=8 / processors and O.2 n=4 / memory space. On the basis of the generation technique of Chang et al., Lou and Chang [12] successfully parallelized the search stage of Horowitz and Sahni's two-list algorit...