Comparison of OpenMP &amp; OpenCL Parallel Processing Technologies

Thouti, Krishnahari; Sathe, S. R.

doi:10.14569/ijacsa.2012.030410

Cited by 12 publications

(7 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A vendor-specific compiler is employed to optimize OpenCL to the target architecture. Previous studies (such as [11,35]) have shown that implementations with OpenCL achieve very close performance to those with native languages such as CUDA and OpenMP on the GPU and the CPU, respectively.…”

Section: Openclmentioning

confidence: 95%

Revisiting co-processing for hash joins on the coupled CPU-GPU architecture

2013

Proc. VLDB Endow.

107

View full text Add to dashboard Cite

Query co-processing on graphics processors (GPUs) has become an effective means to improve the performance of main memory databases. However, the relatively low bandwidth and high latency of the PCI-e bus are usually bottleneck issues for co-processing. Recently, coupled CPU-GPU architectures have received a lot of attention, e.g. AMD APUs with the CPU and the GPU integrated into a single chip. That opens up new opportunities for optimizing query coprocessing. In this paper, we experimentally revisit hash joins, one of the most important join algorithms for main memory databases, on a coupled CPU-GPU architecture. Particularly, we study the fine-grained co-processing mechanisms on hash joins with and without partitioning. The co-processing outlines an interesting design space. We extend existing cost models to automatically guide decisions on the design space. Our experimental results on a recent AMD APU show that (1) the coupled architecture enables fine-grained co-processing and cache reuses, which are inefficient on discrete CPU-GPU architectures; (2) the cost model can automatically guide the design and tuning knobs in the design space; (3) fine-grained co-processing achieves up to 53%, 35% and 28% performance improvement over CPUonly, GPU-only and conventional CPU-GPU co-processing, respectively. We believe that the insights and implications from this study are initial yet important for further research on query co-processing on coupled CPU-GPU architectures.

show abstract

Section: Openclmentioning

confidence: 95%

Revisiting co-processing for hash joins on the coupled CPU-GPU architecture

2013

Proc. VLDB Endow.

107

View full text Add to dashboard Cite

show abstract

“…OpenCL programs can be coded once and run on any OpenCL-compatible devices. Existing studies [11,34] have shown that programs in OpenCL can achieve very close performance to those in platform-specific languages such as CUDA for NVIDIA GPUs and OpenMP for CPUs. For example, Fang et al [11] demonstrate that the CUDA-based implementations are at most 30% better than OpenCL-based implementations on NVIDIA GPUs.…”

Section: Unified Programming Interfacementioning

confidence: 99%

“…For example, Fang et al [11] demonstrate that the CUDA-based implementations are at most 30% better than OpenCL-based implementations on NVIDIA GPUs. On CPUs, OpenCL even outperforms OpenMP in many scenarios [34].…”

Section: Unified Programming Interfacementioning

confidence: 99%

In-cache query co-processing on coupled CPU-GPU architectures

Zhang

2014

Proc. VLDB Endow.

View full text Add to dashboard Cite

Recently, there have been some emerging processor designs that the CPU and the GPU (Graphics Processing Unit) are integrated in a single chip and share Last Level Cache (LLC). However, the main memory bandwidth of such coupled CPU-GPU architectures can be much lower than that of a discrete GPU. As a result, current GPU query coprocessing paradigms can severely suffer from memory stalls. In this paper, we propose a novel in-cache query co-processing paradigm for main memory On-Line Analytical Processing (OLAP) databases on coupled CPU-GPU architectures. Specifically, we adapt CPU-assisted prefetching to minimize cache misses in GPU query co-processing and CPU-assisted decompression to improve query execution performance. Furthermore, we develop a cost model guided adaptation mechanism for distributing the workload of prefetching, decompression, and query execution between CPU and GPU. We implement a system prototype and evaluate it on two recent AMD APUs A8 and A10. The experimental results show that 1) in-cache query co-processing can effectively improve the performance of the state-of-the-art GPU co-processing paradigm by up to 30% and 33% on A8 and A10, respectively, and 2) our workload distribution adaption mechanism can significantly improve the query performance by up to 36% and 40% on A8 and A10, respectively.

show abstract

“…A demanding need to increase the computational performance in science and engineering headed for heterogeneous computing and highly parallel architectures thus created a strong need for programmers to develop infrastructure in the form of libraries routine to support computing is heterogeneous hardware platforms [9]. Faster executions of public key cryptography and precisely RSA are currently of extreme importance.…”

Section: Figure 1 the Different Architecture Cpu Vs Gpu [7]mentioning

confidence: 99%

Parallelizing RSA Algorithm on Multicore CPU and GPU

Fadhil¹,

Younis²

2014

IJCA

View full text Add to dashboard Cite

Public key algorithms are extensively known to be slower than symmetric key alternatives in the a r e a of cryptographic algorithms for the reason of their basis in modular arithmetic. The most public key algorithm widely used is the RSA. Therefore, how to enhance the speed of RSA algorithm has been the research significant topic in the computer security as well as in computing fields. With remarkable increase in the computing capability of the modern Graphics Processing Unit's (GPUs) as a co-processor of the CPU, one can significantly benefit from the Single Instruction Multiple Thread (SIMT) style of computing. This paper proposes a hybrid system to parallelize the RSA for multicore CPU and many cores GPUs with variable key size. In doing so, three variants implementation for the RSA algorithm are done to facilitate the performance comparison against Crypto++ library and sequential counterpart. The GPU implementation gained approximately 23 speed up factor over the sequential CPU implementation; while the multithread CPU implementation gained only 6 speed up factor over the sequential CPU implementation as far as the latency is concerned. Furthermore, additional speedup could be gained as far as the throughput is concerned; the throughput gained for 1024 bits is ~1800 msg/sec; as for 2048 bits is ~250 msg/sec. Due to overlapping of multithread operation whenever free resources are available. The experiments are conducted on a laptop with Intel Core I7-2670QM, 2.20 GHz CPU and Nvidia GeForce GT630M GPU. Results reveal that the GPU is appropriate to speed up the RSA algorithm.

show abstract

Comparison of OpenMP & OpenCL Parallel Processing Technologies

Cited by 12 publications

References 2 publications

Revisiting co-processing for hash joins on the coupled CPU-GPU architecture

Revisiting co-processing for hash joins on the coupled CPU-GPU architecture

In-cache query co-processing on coupled CPU-GPU architectures

Parallelizing RSA Algorithm on Multicore CPU and GPU

Contact Info

Product

Resources

About