Partitioning and Bucketing Techniques to Speed up Query Processing in Spark-SQL

Ramdane, Yassine; Boussaïd, Omar; Kabachi, Nadia; Bentayeb, Fadila

doi:10.1109/padsw.2018.8644891

Cited by 8 publications

(3 citation statements)

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A technique established on frequent itemset mining was proposed in [16] towards partition, bucket, and sort tables (PBSTs) into a big DW with the more common predicate attributes in the workload. This method considered the quantity of the relation attributes, data skew, and the physical features of the cluster nodes.…”

Section: State Of the Art Of Horizontal Fragmentation Methodsmentioning

confidence: 99%

Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses

et al. 2022

View full text Add to dashboard Cite

Data warehousing gives frameworks and means for enterprise administrators to methodically prepare, comprehend, and utilize the data to improve strategic decision-making skills. One of the principal challenges to data warehouse designers is fragmentation. Currently, several fragmentation approaches for data warehouses have been developed since this technique can decrease the OLAP (online analytical processing) query response time and it provides considerable benefits in table loading and maintenance tasks. In this paper, a horizontal fragmentation method, called FTree, that uses decision trees to fragment data warehouses is presented to take advantage of the effectiveness that this technique provides in classification. FTree determines the OLAP queries with major relevance, evaluates the predicates found in the workload, and according to this, builds the decision tree to select the horizontal fragmentation scheme. To verify that the design is correct, the SSB (star schema benchmark) was used in the first instance; later, a tourist data warehouse was built, and the fragmentation method was tested on it. The results of the experiments proved the efficacy of the method.

show abstract

Section: State Of the Art Of Horizontal Fragmentation Methodsmentioning

confidence: 99%

Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses

et al. 2022

View full text Add to dashboard Cite

show abstract

“…We select the 𝑇 𝑃𝐶 − 𝐷𝑆 [59], 𝑇 𝑃𝐶 − 𝐻 [12], and three programs from 𝐻𝑖𝐵𝑒𝑛𝑐ℎ [31] as representative programs to evaluate LOCAT, as shown in Table 1. 𝑇 𝑃𝐶 − 𝐷𝑆, containing 104 queries, has been widely used in Spark SQL systems for research and development of optimization techniques [18,32,47]. It models complex decision support functions to provide highly comparable, controlled, and repeatable tasks in evaluating the performance of Spark SQL systems [7].…”

Section: Representative Programsmentioning

confidence: 99%

LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications

Xin¹,

Hwang²,

Yu³

2022

Preprint

View full text Add to dashboard Cite

Spark SQL has been widely deployed in industry but it is challenging to tune its performance. Recent studies try to employ machine learning (ML) to solve this problem. They however suffer from two drawbacks. First, it takes a long time (high overhead) to collect training samples. Second, the optimal configuration for one input data size of the same application might not be optimal for others.To address these issues, we propose a novel Bayesian Optimization (BO) based approach named LOCAT to automatically tune the configurations of Spark SQL applications online. LOCAT innovates three techniques. The first technique, named QCSA, eliminates the configuration-insensitive queries by Query Configuration Sensitivity Analysis (QCSA) when collecting training samples. The second technique, dubbed DAGP, is a Datasize-Aware Gaussian Process (DAGP) which models the performance of an application as a distribution of functions of configuration parameters as well as input data size. The third technique, called IICP, Identifies Important Configuration Parameters (IICP) with respect to performance and only tunes the important parameters. As such, LOCAT can tune the configurations of a Spark SQL application with low overhead and adapt to different input data sizes.We employ Spark SQL applications from benchmark suites𝑇 𝑃𝐶− 𝐷𝑆, 𝑇 𝑃𝐶 − 𝐻 , and 𝐻𝑖𝐵𝑒𝑛𝑐ℎ running on two significantly different clusters, a four-node ARM cluster and an eight-node x86 cluster, to evaluate LOCAT. The experimental results on the ARM cluster show that LOCAT accelerates the optimization procedures of Tuneful [22], DAC [66], GBO-RL [36], and QTune [37] by factors of 6.4×, 7.0×, 4.1×, and 9.7× on average, respectively. On the x86 cluster, LOCAT reduces the optimization time of Tuneful, DAC, GBO-RL, and QTune by factors of 6.4×, 6.3×, 4.0×, and 9.2× on average, respectively. Moreover, LOCAT improves the performance of the applications on

show abstract

“…In [28] the authors propose a technique based on frequent itemsets mining, to Partition, Bucket and Sort the Tables of a big data warehouse with the most frequent predicate attributes in the queries: they apply data mining algorithms on a queries workload to determine the most frequent predicate attributes, and use a hash-partitioning technique without making any assumptions about the filters used in the query predicates (i.e. in the Where clause of a SQL query).…”

Section: Related Workmentioning

confidence: 99%

CoPart: a context-based partitioning technique for big data

Migliorini

Belussi²,

Quintarelli³

et al. 2021

J Big Data

View full text Add to dashboard Cite

The MapReduce programming paradigm is frequently used in order to process and analyse a huge amount of data. This paradigm relies on the ability to apply the same operation in parallel on independent chunks of data. The consequence is that the overall performances greatly depend on the way data are partitioned among the various computation nodes. The default partitioning technique, provided by systems like Hadoop or Spark, basically performs a random subdivision of the input records, without considering the nature and correlation between them. Even if such approach can be appropriate in the simplest case where all the input records have to be always analyzed, it becomes a limit for sophisticated analyses, in which correlations between records can be exploited to preliminarily prune unnecessary computations. In this paper we design a context-based multi-dimensional partitioning technique, called CoPart, which takes care of data correlation in order to determine how records are subdivided between splits (i.e., units of work assigned to a computation node). More specifically, it considers not only the correlation of data w.r.t. contextual attributes, but also the distribution of each contextual dimension in the dataset. We experimentally compare our approach with existing ones, considering both quality criteria and the query execution times.

show abstract

Partitioning and Bucketing Techniques to Speed up Query Processing in Spark-SQL

Cited by 8 publications

References 5 publications

Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses

Decision-Tree-Based Horizontal Fragmentation Method for Data Warehouses

LOCAT: Low-Overhead Online Configuration Auto-Tuning of Spark SQL Applications

CoPart: a context-based partitioning technique for big data

Contact Info

Product

Resources

About