A well-established technique for capturing database provenance as annotations on data is to instrument queries to propagate such annotations. However, even sophisticated query optimizers often fail to produce efficient execution plans for instrumented queries. We develop provenance-aware optimization techniques to address this problem. Specifically, we study algebraic equivalences targeted at instrumented queries and alternative ways of instrumenting queries for provenance capture. Furthermore, we present an extensible heuristic and cost-based optimization framework utilizing these optimizations. Our experiments confirm that these optimizations are highly effective, improving performance by several orders of magnitude for diverse provenance tasks.
!
INTRODUCTIONDatabase provenance, information about the origin of data and the queries and/or updates that produced it, is critical for debugging queries, auditing, establishing trust in data, and many other use cases. The de facto standard for database provenance [1], [2] is to model provenance as annotations on data and define a query semantics that determines how annotations propagate. Under such a semantics, each output tuple t of a query Q is annotated with its provenance, i.e., a combination of input tuple annotations that explains how these inputs were used by Q to derive t.Database provenance systems such as Perm [3], GProM [4], DBNotes [5], LogicBlox [2], declarative Datalog debugging [6], ExSPAN [7], and many others use a relational encoding of provenance annotations. These systems typically compile queries with annotated semantics into relational queries that produce this encoding of provenance annotations following the process outlined in Fig. 23a. We refer to this reduction from annotated to standard relational semantics as provenance instrumentation or instrumentation for short. The example below introduces a relational encoding of provenance polynomials [1] and the instrumentation approach for this model implemented in Perm [3].Example 1. Consider a query over the database in Fig. 1 returning shops that sell items which cost more than $20:The query's result is shown in Fig. 1d. Using provenance The instrumentation we are using here is defined for any SPJ (Select-Project-Join) query (and beyond) based on a set of algebraic rewrite rules (see [3] for details).The present paper extends [8]. Additional details are presented in the appendix.
Instrumentation PipelinesIn this work, we focus on optimizing instrumentation pipelines such as the one from Example 1. These pipelines divide the compilation of a frontend language to a target language into multiple compilation steps using one or more intermediate languages. We now introduce a subset of the pipelines supported by our approach to illustrate the breadth of applications supported by instrumentation. Our approach can be applied to any data management task that can be expressed as instrumentation. Notably, our implementation already supports additional pipelines, e.g., for summarizing provenance and managing ...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.