The Unified Memory Machine (UMM) is a theoretical parallel computing model that captures the essence of the global memory access of GPUs. A sequential algorithm is oblivious if an address accessed at each time does not depend on input data. Many important tasks including matrix computation, signal processing, sorting, dynamic programming, and encryption/decryption can be performed by oblivious sequential algorithms. Bulk execution of a sequential algorithm is to execute it for many different inputs in turn or at the same time. The main contribution of this paper is to show that the bulk execution of an oblivious sequential algorithm can be implemented to run on the UMM very efficiently. More specifically, the bulk execution for Ô different inputs can be implemented to run Ç´Ô Ø Û · Ðص time units using Ô threads on the UMM with memory width Û and memory access latency Ð, where Ø is the running time of the oblivious sequential algorithm. We also prove that this implementation is time optimal. Further, we have implemented two oblivious sequential algorithms to compute the prefix-sums of an array of size Ò and to find the optimal triangulation of a convex Ò-gon using the dynamic programming technique. The prefix-sum algorithm is a quite simple example of oblivious algorithms, while the optimal triangulation algorithm is rather complicated. The experimental results on GeForce GTX Titan show that our implementations for the bulk execution of these two algorithms can be 150 times faster than that of a single CPU if they have many inputs. This fact implies that our idea for the bulk execution of oblivious sequential algorithms is a potent method to elicit the capability of CUDA-enabled GPUs very easily.
Summary
Several important tasks, including matrix computation, signal processing, sorting, dynamic programming, encryption, and decryption, can be performed by oblivious sequential algorithms. A sequential algorithm is oblivious if an address accessed at each time does not depend on the input data. A bulk execution of a sequential algorithm is to execute it for many independent inputs in turn or in parallel. A number of works have been devoted to design and implement parallel algorithms for a single input. However, none of these works evaluated the bulk execution performance of these algorithms. The first contribution of this paper is to present a time‐optimal implementation for bulk execution of an oblivious sequential algorithm. Our second contribution is to develop a tool, named C2CU, which automatically generates a CUDA C program for a bulk execution of an oblivious sequential algorithm. The C2CU has been used to generate CUDA C programs for the bulk execution of the bitonic sorting, Floyd‐Warshall, and Montgomery modulo multiplication algorithms. Compared to a sequential implementation on a single CPU, the generated CUDA C programs for the above algorithms run, respectively, 199, 54, and 78 times faster.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.