Abstract. The queries defined on data warehouses are complex and use several join operations that induce an expensive computational cost. This cost becomes even more prohibitive when queries access very large volumes of data. To improve response time, data warehouse administrators generally use indexing techniques such as star join indexes or bitmap join indexes. This task is nevertheless complex and fastidious. Our solution lies in the field of data warehouse auto-administration. In this framework, we propose an automatic index selection strategy. We exploit a data mining technique ; more precisely frequent itemset mining, in order to determine a set of candidate indexes from a given workload. Then, we propose several cost models allowing to create an index configuration composed by the indexes providing the best profit. These models evaluate the cost of accessing data using bitmap join indexes, and the cost of updating and storing these indexes.
Abstract. Benchmarking data warehouses is a means to evaluate the performance of systems and the impacts of different technical choices. Developed on relational models which have been for a few years the most used to support classical data warehousing applications such as Star Schema Benchmark (SSB). SSB is designed to measure performance of database products when executing star schema queries. As the volume of data keeps growing, the types of data generated by applications become richer than before. As a result, traditional relational databases are challenged to manage big data. Many IT companies attempt to manage big data challenges using a NoSQL (Not only SQL) database, and may use a distributed computing system. NoSQL databases are known to be non-relational, horizontally scalable, distributed. We present in this paper a new benchmark for columnar NoSQL data warehouse, namely CNSSB (Columnar NoSQL Star Schema Benchmark). CNSSB is derived from SSB and allows generating synthetic data and queries set to evaluate column-oriented NoSQL data warehouse. We have implemented CNSSB under HBase columnoriented database management system (DBMS), and apply its charge of queries to evaluate performance between two SQL skins, Phoenix and HQL (Hive Query Language). That allowed us to observe a better performance of Phoenix compared to HQL.
Abstract. In this position paper, we draw the specifications for a novel benchmark for comparing parallel processing frameworks in the context of big data applications hosted in the cloud. We aim at filling several gaps in already existing cloud data processing benchmarks, which lack a reallife context for their processes, thus losing relevance when trying to assess performance for real applications. Hence, we propose a fictitious news site hosted in the cloud that is to be managed by the framework under analysis, together with several objective use case scenarios and measures for evaluating system performance. The main strengths of our benchmark definition are parallelization capabilities supporting cloud features and big data properties.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.