GPL

Paul, Johns; He, Jiong; He, Bingsheng

doi:10.1145/2882903.2915224

Cited by 63 publications

(22 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…The study found that column store layouts are uniquely suitable for GPUs due to its ability to 1) ensure more coalesced data access, 2) achieve better compression ratios and 3) minimize the amount of data that needs to be moved to the GPUs. Later studies [11,119] arrived at a similar conclusion and hence modern GPU database systems [4,9] have almost exclusively adopted column store layout.…”

Section: Data Storage and Data Accessmentioning

confidence: 92%

“…Weaver [31] proposed a design where the final gather operation is not performed when the filter kernel is fused with other operator kernels. Later, GPL [4] adopted a similar pipelined design where the gather stage is skipped. Now, most early GPU filter implementations [9,44], made use of branching if statements while checking the filter criteria.…”

Section: Selection/filtermentioning

confidence: 99%

“…Following these studies, Paul et al [4] demonstrated that a single relational operator is often unable to efficiently use all the hardware resources available on a single GPU. Taking advantage of the support for concurrent kernel execution, GPL improved the resource utilization of GPU hardware when processing relational Chapter 3.…”

Section: Concurrency Controlmentioning

confidence: 99%

“…This use of higher bandwidth global memory and their massively parallel architecture design is ideal for OLAP systems that require the parallel processing of a larger number of data entries. Due to these reasons, we have witnessed significant efforts in the past decade to enable the use of GPUs in high performance data analytics operations [3][4][5][6].…”

Section: Chapter 1 Introductionmentioning

confidence: 99%

“…More recent studies have focused on improving the query execution performance by minimizing the cost of moving the intermediate data between the relational operators/GPU kernels. To achieve this, techniques such as efficient pipelining of relational operators [4,30], dynamic fusion of GPU kernels [31] and just-in-time (JIT) code generation [3,5,6] have been explored.…”

Section: Chapter 1 Introductionmentioning

confidence: 99%

See 4 more Smart Citations

In-memory analytical query processing on GPUs

Paul¹

Self Cite

View full text Add to dashboard Cite

The high global memory bandwidth and the large number of parallel cores available in modern Graphics Processing Units (GPUs) make them ideal for highperformance Online Analytical Processing (OLAP) systems. However, it is challenging to design efficient high-performance query processing systems for GPUs due to the following reasons: 1) the rapid evolution of GPU hardware in the last decade, 2) the significant differences in the hardware architecture of GPUs when compared to CPUs, 3) the high overhead of moving the data between the CPU and GPU and 4) the small global memory size of a single GPU that necessitates the access of remote data over PCIe or NVLink interconnects when processing large data sets. In this thesis, we study existing query processing systems for GPUs and propose techniques to improve query execution efficiency on GPUs.We begin by studying the performance of hash join, which is one of the most compute and memory intensive relational operator in OLAP systems. Specifically, we first revisit the major GPU hash join implementations in the past decade that were designed to execute on a single GPU. We then detail how these implementations take advantage of different GPU architecture features and conduct a comprehensive evaluation of their performance and cost-efficiency using different generations of GPUs. This helps shed light on the impact of different GPU architecture features on the performance of the hash join operation and identify the factors guiding the choice of these features. We further study how data characteristics like skew and match rate impact the performance of GPU hash join implementations. Novel techniques to improve the performance of the hash join operation when joining input relations with severe skew or high match rate were also proposed as part of this study. Our evaluation finds that the proposed techniques help avoid any performance degradation when joining skewed input relations and achieve up to 2.5x better performance when joining input relations with high match rate.Next, we extend our study on the hash join operation to modern multi-GPU architectures. The recent scale-up of GPU hardware through the integration of multiple xi Contentsxii GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalablity of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this thesis, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses...

show abstract

Section: Data Storage and Data Accessmentioning

confidence: 92%

Section: Selection/filtermentioning

confidence: 99%

Section: Concurrency Controlmentioning

confidence: 99%

Section: Chapter 1 Introductionmentioning

confidence: 99%

Section: Chapter 1 Introductionmentioning

confidence: 99%

See 3 more Smart Citations

In-memory analytical query processing on GPUs

Paul¹

Self Cite

View full text Add to dashboard Cite

show abstract

Overtaking CPU DBMSes with a GPU in Whole-Query Analytic Processing with Parallelism-Friendly Execution Plan Optimization

Agbaria

Minor

Peterfreund³

et al. 2017

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Existing work on accelerating analytic DB query processing with (discrete) GPUs fails to fully realize their potential for speedup through parallelism: Published results do not achieve significant speedup over more performant CPU-only DBMSes when processing complete queries. This paper presents a successful e ort to better meet this challenge, in the form of a proof-of-concept query processing framework. The framework constitutes a graft onto an existing DBMS, altering some parts of it and replacing its execution engine entirely. It intensively refactors query execution plans, making them better-parallelizable, before executing them on either a CPU or on GPU. This results in a significant speedup even on a CPU, and a further speedup when using a GPU, over the chosen host DBMS (MonetDB) -which itself already bests most published results utilizing a GPU for query processing. Finally, we outline some concrete future improvements on our results which can cut processing time by half and possibly much more.

show abstract