To Partition, or Not to Partition, That is the Join Question in a Real System

Bandle, Maximilian; Giceva, Jana; Neumann, Thomas

doi:10.1145/3448016.3452831

Cited by 25 publications

(7 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In Fig. 23a, we place a preaggregation at 3 , where it calculates the initial aggregates, while Γ merges and finalizes them. Figure 23b shows an additional preaggregation placed in between at 2 , which merges the aggregates from its input.…”

Section: Finalization Computing An Expression On Aggregationsmentioning

confidence: 99%

See 1 more Smart Citation

Practical planning and execution of groupjoin and nested aggregates

2022

View full text Add to dashboard Cite

Groupjoins combine execution of a join and a subsequent group-by. They are common in analytical queries and occur in about "Equation missing" of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention in many-core systems. Efficient implementations of groupjoins are highly desirable, as groupjoins are not only used to fuse group-by and join, but are also useful to efficiently execute nested aggregates. For these, the query optimizer needs to reason over the result of aggregation to optimally schedule it. Traditional systems quickly reach their limits of selectivity and cardinality estimations over computed columns and often treat group-by as an optimization barrier. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose four novel techniques, aggregate estimates to predict the result distributions of aggregates, parallel groupjoin execution for scalable execution of groupjoins, index groupjoins, and a greedy eager aggregation optimization technique that introduces nested preaggregations to significantly improve execution plans. The resulting system has improved estimates, better execution plans, and a contention-free evaluation of groupjoins, which speeds up TPC-H and TPC-DS queries significantly.

show abstract

Section: Finalization Computing An Expression On Aggregationsmentioning

confidence: 99%

“…Re-using hash partitions, and even whole hash tables is a well-known optimization [18,36]. One often discussed question is, if hash tables should be partitioned or nonpartitioned [3]. Our proposed approaches in Sect.…”

Section: Related Workmentioning

confidence: 99%

Practical planning and execution of groupjoin and nested aggregates

2022

View full text Add to dashboard Cite

show abstract

“…Consequently, there is a large body of related work that optimizes hash joins [39,49,55] and hash aggregations [38,52,61]. One often discussed question is, if hash tables should be partitioned or non-partitioned [3]. Our proposed approaches in Section 3 try to use a non-partitioned hash table to avoid materializing data, while using thread-local partitioning for heavy-hitters.…”

Section: Related Workmentioning

confidence: 99%

“…Order Benchmark (JOB)3 : Since IMDb primarily stores facts as strings, we extract a separate table that contains the vote count and the user rating for movies, to allow statistics collection. On these columns, we define five additional aggregation queries that calculate statistics on the new numerical columns.…”

mentioning

confidence: 99%

A practical approach to groupjoin and nested aggregates

Fent

Neumann

2021

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

Groupjoins, the combined execution of a join and a subsequent group by, are common in analytical queries, and occur in about 1/8 of the queries in TPC-H and TPC-DS. While they were originally invented to improve performance, efficient parallel execution of groupjoins can be limited by contention, which limits their usefulness in a many-core system. Having an efficient implementation of groupjoins is highly desirable, as groupjoins are not only used to fuse group by and join but are also introduced by the unnesting component of the query optimizer to avoid nested-loops evaluation of aggregates. Furthermore, the query optimizer needs be able to reason over the result of aggregation in order to schedule it correctly. Traditional selectivity and cardinality estimations quickly reach their limits when faced with computed columns from nested aggregates, which leads to poor cost estimations and thus, suboptimal query plans. In this paper, we present techniques to efficiently estimate, plan, and execute groupjoins and nested aggregates. We propose two novel techniques, aggregate estimates to predict the result distribution of aggregates, and parallel groupjoin execution for a scalable execution of groupjoins. The resulting system has significantly better estimates and a contention-free evaluation of groupjoins, which can speed up some TPC-H queries up to a factor of 2.

show abstract

“…This is an important problem because it occurs in every large database. The ideal solution would allow, despite the increase of the number of records in the tables, to perform operations on the database as quickly as at the time of its implementation [2].…”

Section: Introductionmentioning

confidence: 99%

Efficiently Processing Data in Table With Billions of Records

Bednarczuk

Borsuk²

2022

IAPGOS

View full text Add to dashboard Cite

Over time, systems connected to databases slow down. This is usually due to the increase in the amount of data stored in individual tables, counted even in the billions of records. Nevertheless, there are methods for making the speed of the system independent of the number of records in the database. One of these ways is table partitioning. When used correctly, the solution can ensure efficient operation of very large databases even after several years. However, not everything is predictable because of some undesirable phenomena become apparent only with a very large amount of data. The article presents a study of the execution time of the same queries with increasing number of records in a table. These studies reveal and present the timing and circumstances of the anomaly for a certain number of records.

show abstract

To Partition, or Not to Partition, That is the Join Question in a Real System

Cited by 25 publications

References 40 publications

Practical planning and execution of groupjoin and nested aggregates

Practical planning and execution of groupjoin and nested aggregates

A practical approach to groupjoin and nested aggregates

Efficiently Processing Data in Table With Billions of Records

Contact Info

Product

Resources

About