Keywords graph queries, relational algebra, query optimization
IntroductionThe key components of Big Data are often defined as variety, velocity and volume [28] of data. Applications operating on continuously changing graphs are a prime example: the semi-structured graph-like nature introduces a high variety, changes happen at high velocity, and datasets are often high-volume. Such applications include fraud detection in financial transactions [27], validation of engineering models [3], and static analysis of source code repositories [35]. These use cases provide a set of complex queries that need to be evaluated continuously on each change of the underlying graph.Traditional approaches need to reevaluate each query upon each change, which often takes minutes on a large dataset. In contrast, incremental query evaluation caches interim results, hence it only requires reevaluation on a small fragment of the dataset impacted by the change. This leads to significant speedup for large and continuously changing data. Although several approaches exist for incremental query evaluation [9,20] in the context of expert systems, incremental query evaluation is not in widespread use in graph databases.In order to predict query performance at runtime, relational databases synthesize and evaluate different query plans which impose a certain ordering on relational algebraic operations prescribed by the query. Optimizing query plans is a challenging task, since a wide variety of query plans may exist even for simple queries with different costs. Database engines use heuristics-based optimization techniques and evaluate a cost function for the different query plans [10].Query plans have been adapted for graph query engines using a local-search based query evaluation strategy where it is called the search plan. Optimization techniques may exploit the type and multiplicity information defined in the graph schema (or metamodel) [29,22] or rely upon runtime statistics of the instance graph [11,38,39].In case of incremental graph query engines, the structure and the content of caches have the most significant impact on query performance. Therefore, optimization is directed to reduce execution time and memory consumption imposed by a complex network of caches [37].