In the past few years, executing high-concurrency queries with interactive SQL query engines on Hadoop has become an important activity for many organizations. However, these systems do not adopt Multi-Query Optimization (MQO) to accelerate the process. There are two major concerns.Firstly, traditional MQO researches assume that multiple queries have high similarity. However, these systems usually serve a variety of applications. Although queries from the same application have high similarity, queries from different applications may have low similarity, so using traditional MQO will be inefficient and time consuming. Secondly, integrating MQO may lead to lots of system modifications. To integrate MQO into interactive SQL query engines on Hadoop efficiently, a query grouping-based MQO framework is proposed. A lightweight mechanism is used to represent SQL queries, on which a grouping method is exploited to speed up the optimization process. A cost model is integrated to estimate the execution cost of interactive SQL query engines on Hadoop. By using the proposed framework, we modify Impala system to support MQO, and the experimental results on TPC-DS show significant performance improvements.
KEYWORDSgrouping method, Impala system, multi-query optimization
INTRODUCTIONLarge-scale analytical data processing has become a commonplace in many enterprises and research groups. Many interactive SQL query engines on Hadoop are designed to deal with the scenario. Query processing on large-scale analytical data is introduced by Google Dremel system. 1 Since then, several such systems appear, eg, Apache Impala system, 2 Apache Hawq system, 3 Apache Drill system, 4 and Facebook Presto system. These systems usually support a wide range of applications concurrently and have a large throughput.Multi-Query Optimization (MQO) 5 is a well-known database research problem, which has been studied by database community since the 1980s.The most important problem of MQO is to take full advantage of common tasks by constructing a global plan. In order to solve this problem, many algorithms have been proposed, including heuristic algorithms 6-8 and some improved algorithms. [9][10][11] There are two alternative system architectures that can be used for integrating MQO. One is to take full advantage of the local optimizer, which can only optimize a single query at a time and merge the optimal plans of all queries together. The other is to modify the optimizer to process a set of queries together and generate the optimal plan of all queries directly. However, an MQO module may need to be developed from scratch for this architecture. 5As far as we know, none of the interactive SQL query engines on Hadoop chooses to integrate these classical MQO algorithms for performance improvement. There are two main reasons. Firstly, traditional MQO algorithms assume that multiple queries have high similarity. However, these systems usually serve a variety of applications. Although queries from the same application have high similarity, queries from differ...