C2CU: a CUDA C program generator for bulk execution of a sequential algorithm

Concurrency and Computation

2018

Self Cite

Summary The bulk execution is to execute some computation for many different inputs in turn or at the same time. The main contribution of this paper is to propose a parallel processing technique for the bulk execution of the dynamic programming using the GPU (Graphics Processing Unit). Especially, we focus on the optimal polygon triangulation problem for a lot of polygons. We consider programming issues of the GPU architecture such as coalesced memory access of the global memory, warp divergence avoidance, and reduction of CUDA kernel calls. In the GPU implementation, we propose two thread assignment methods that efficiently perform the parallel execution with a lot of threads on thousands of cores in the GPU. The experimental results show that our GPU implementation on NVIDIA TITAN V attains a speed‐up factor of up to 106.05 and 26.78 over the single‐thread and 8‐thread CPU implementations on Intel Core i7‐6700K CPU, respectively.

Section: Discussionmentioning

confidence: 99%

Bulk execution of the dynamic programming for the optimal polygon triangulation problem on the GPU

Yamashita

Concurrency and Computation

2018

Self Cite

“…We have implemented the single thread implementation such that each thread computes one multiplication. This implementation is based on the idea proposed in [16]. In the implementation, there is no warp divergence since all threads execute the same instructions, that is, this implementation is also based on warp-synchronous programming technique.…”

Section: Resultsmentioning

confidence: 99%

“…However, there is no research that is premised on the bulk execution. On the other hand, to accelerate the bulk execution of multiple-length multiplication, our proposed method uses the idea in [16] that shows a technique of more efficient GPU implementations for the bulk execution by considering the GPU architecture.…”

Section: Introductionmentioning

confidence: 99%

GPU-Accelerated Bulk Execution of Multiple-Length Multiplication with Warp-Synchronous Programming Technique

Honda

IEICE Trans. Inf. & Syst.

2016

Self Cite

SUMMARYIn this paper, we present a GPU implementation of bulk multiple-length multiplications. The idea of our GPU implementation is to adopt a warp-synchronous programming technique. We assign each multiple-length multiplication to one warp that consists of 32 threads. In parallel processing using multiple threads, usually, it is costly to synchronize execution of threads and communicate within threads. In warpsynchronous programming technique, however, execution of threads in a warp can be synchronized instruction by instruction without any barrier synchronous operations. Also, inter-thread communication can be performed by warp shuffle functions without accessing shared memory. The experimental results show that our GPU implementation on NVIDIA GeForce GTX 980 attains a speed-up factor of 52 for 1024-bit multiplelength multiplication over the sequential CPU implementation. Moreover, we use this 1024-bit multiple-length multiplication for larger size of bits as a sub-routine. The GPU implementation attains a speed-up factor of 21 for 65536-bit multiple-length multiplication.

“…CUDA gives developers access to the virtual instruction set and memory of the parallel computational elements in NVIDIA GPUs. In many cases, GPUs are more efficient than multicore processors, since they have hundreds of processor cores and very high memory bandwidth 3‐5 …”

Section: Introductionmentioning

confidence: 99%

Efficient parallel implementations to compute the diameter of a graph

Takafuji

Concurrency and Computation

2020

Self Cite

The Floyd-Warshall algorithm is a well-known algorithm to compute the distance of all pairs of nodes of a graph. The Blocked Floyd-Warshall algorithm, a variant of the Floyd-Warshall has been proposed to accelerate the Floyd-Warshall algorithm by means of a graphics processing unit (GPU) architecture. The previously published GPU implementations for the Blocked Floyd-Warshall algorithm perform many separated kernel calls for costly barrier synchronization. The main contribution of this article is to present efficient implementations of the Blocked Floyd-Warshall algorithm, which performs no barrier synchronization and invokes only one kernel call. Experimental results using NVIDIA Tesla V100 show that our implementation runs 1.05-1.31 times faster than the previously published one. Our implementation with SIMD functions also runs 1.00-1.28 times faster than it. Second, we propose efficient GPU implementations to execute the Blocked Floyd-Warshall algorithm for many graphs at the same time. From the experimental results, our single kernel implementation runs 1.03-1.60 times faster than multiple kernel one. In terms of implementations with SIMD functions, our single kernel implementation runs 1.01-1.89 times faster than it. We also propose the low-latency implementations for many graphs. Finally, we implemented the parallel Floyd-Warshall algorithm on the multicore processors.