Data-parallel query processing on non-uniform data

Funke, Henning; Teubner, Jens

doi:10.14778/3380750.3380758

Cited by 22 publications

(9 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sitaridi et al [157] proposed the splitting of string matching into multiple steps to minimize divergence. More recently, DogQC [6] detailed the high cost of importing data using dictionary encoding for strings and instead proposed the use of techniques like push-down parallelism and lane refill to minimize the divergence encountered by GPU threads when processing variable-length string data.…”

Section: Selection/filtermentioning

confidence: 99%

“…The new generation of compiled database systems for GPUs have helped minimize the synchronization overhead encountered by pipelined systems [3,5,6]. This is because multiple operators are co-located within the same kernel for compiled systems, allowing them to synchronize between operators using lower overhead thread-block level synchronization intrinsics for operators within the same pipeline.…”

Section: Concurrency Controlmentioning

confidence: 99%

“…This use of higher bandwidth global memory and their massively parallel architecture design is ideal for OLAP systems that require the parallel processing of a larger number of data entries. Due to these reasons, we have witnessed significant efforts in the past decade to enable the use of GPUs in high performance data analytics operations [3][4][5][6].…”

Section: Chapter 1 Introductionmentioning

confidence: 99%

“…More recent studies have focused on improving the query execution performance by minimizing the cost of moving the intermediate data between the relational operators/GPU kernels. To achieve this, techniques such as efficient pipelining of relational operators [4,30], dynamic fusion of GPU kernels [31] and just-in-time (JIT) code generation [3,5,6] have been explored.…”

Section: Chapter 1 Introductionmentioning

confidence: 99%

“…Despite these efforts, modern GPU database systems still suffer from inefficiencies like severe resource contention [31] and high degree of workload divergence when executing analytical queries [6]. These inefficiencies are often rooted at the unique hardware characteristics of modern GPUs.…”

Section: Chapter 1 Introductionmentioning

confidence: 99%

See 4 more Smart Citations

In-memory analytical query processing on GPUs

Paul¹

View full text Add to dashboard Cite

The high global memory bandwidth and the large number of parallel cores available in modern Graphics Processing Units (GPUs) make them ideal for highperformance Online Analytical Processing (OLAP) systems. However, it is challenging to design efficient high-performance query processing systems for GPUs due to the following reasons: 1) the rapid evolution of GPU hardware in the last decade, 2) the significant differences in the hardware architecture of GPUs when compared to CPUs, 3) the high overhead of moving the data between the CPU and GPU and 4) the small global memory size of a single GPU that necessitates the access of remote data over PCIe or NVLink interconnects when processing large data sets. In this thesis, we study existing query processing systems for GPUs and propose techniques to improve query execution efficiency on GPUs.We begin by studying the performance of hash join, which is one of the most compute and memory intensive relational operator in OLAP systems. Specifically, we first revisit the major GPU hash join implementations in the past decade that were designed to execute on a single GPU. We then detail how these implementations take advantage of different GPU architecture features and conduct a comprehensive evaluation of their performance and cost-efficiency using different generations of GPUs. This helps shed light on the impact of different GPU architecture features on the performance of the hash join operation and identify the factors guiding the choice of these features. We further study how data characteristics like skew and match rate impact the performance of GPU hash join implementations. Novel techniques to improve the performance of the hash join operation when joining input relations with severe skew or high match rate were also proposed as part of this study. Our evaluation finds that the proposed techniques help avoid any performance degradation when joining skewed input relations and achieve up to 2.5x better performance when joining input relations with high match rate.Next, we extend our study on the hash join operation to modern multi-GPU architectures. The recent scale-up of GPU hardware through the integration of multiple xi Contentsxii GPUs into a single machine and the introduction of higher bandwidth interconnects like NVLink 2.0 has enabled new opportunities of relational query processing on multiple GPUs. However, due to the unique characteristics of GPUs and the interconnects, existing hash join implementations spend up to 66% of their execution time moving the data between the GPUs and achieve lower than 50% utilization of the newer high bandwidth interconnects. This leads to extremely poor scalablity of hash join performance on multiple GPUs, which can be slower than the performance on a single GPU. In this thesis, we propose MG-Join, a scalable partitioned hash join implementation on multiple GPUs of a single machine. In order to effectively improve the bandwidth utilization, we develop a novel multi-hop routing for cross-GPU communication that adaptively chooses...

show abstract

Section: Selection/filtermentioning

confidence: 99%

Section: Concurrency Controlmentioning

confidence: 99%

Section: Chapter 1 Introductionmentioning

confidence: 99%

Section: Chapter 1 Introductionmentioning

confidence: 99%

Section: Chapter 1 Introductionmentioning

confidence: 99%

See 3 more Smart Citations

In-memory analytical query processing on GPUs

Paul¹

View full text Add to dashboard Cite

show abstract

Low-latency query compilation

2022

View full text Add to dashboard Cite

Query compilation is a processing technique that achieves very high processing speeds but has the disadvantage of introducing additional compilation latencies. These latencies cause an overhead that is relatively high for short-running and high-complexity queries. In this work, we present Flounder IR and ReSQL, our new approach to query compilation. Instead of using a general purpose intermediate representation (e.g., LLVM IR) during compilation, ReSQL uses Flounder IR, which is specifically designed for database processing. Flounder IR is lightweight and close to machine assembly. This simplifies the translation from IR to machine code, which otherwise is a costly translation step. Despite simple translation, compiled queries still benefit from the high processing speeds of the query compilation technique. We analyze the performance of our approach with micro-benchmarks and with ReSQL, which employs a full translation stack from SQL to machine code. We show reductions in compilation times up to two orders of magnitude over LLVM and show improvements in overall execution time for TPC-H queries up to 5.5 $$\times $$ × over state-of-the-art systems.

show abstract

Split-bucket partition (SBP): a novel execution model for top-K and selection algorithms on GPUs

Yang,

Zhang,

et al. 2024

J Supercomput

View full text Add to dashboard Cite

Top-K and selection operations are critical in data processing and analysis, and their efficient implementation on GPUs is increasingly important due to the growing demands of data analysis. Existing methods, primarily relying on the bucket partition execution model, encounter challenges such as uneven bucket distribution and latency in merging processes. To address these issues, we introduce a novel Split-Bucket Partition (SBP) execution model that specifically addresses these challenges. Additionally, we propose task and control flow optimizations targeted at top-K and selection algorithms, which further contribute to performance improvements. Our optimized algorithms significantly outperform existing approaches, delivering performance gains of up to $$2.3$$ 2.3 times and $$1.6$$ 1.6 times for different bucket partitioning rules. Our algorithms show robust performance improvements in non-uniform data scenarios, with gains ranging from $$1.9$$ 1.9 times to $$15.5$$ 15.5 times. However, it should be noted that the SBP model has limitations related to shared memory and register utilization, potentially impacting performance. Tests on TU102 and A100 GPU architectures validate the effectiveness of our approach, achieving a maximum speedup of $$2.9$$ 2.9 times. The study suggests that while the SBP model is effective for top-K and selection algorithms, it also holds promise for other computational tasks, setting the stage for future research.

show abstract

Data-parallel query processing on non-uniform data

Cited by 22 publications

References 24 publications

In-memory analytical query processing on GPUs

In-memory analytical query processing on GPUs

Low-latency query compilation

Split-bucket partition (SBP): a novel execution model for top-K and selection algorithms on GPUs

Contact Info

Product

Resources

About