What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?

Khamis, Mahmoud Abo; Ngo, Hung Q.; Suciu, Dan

doi:10.1145/3034786.3056105

Cited by 72 publications

(85 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the RAM model, output-sensitive join algorithms have been extensively studied. The running time of most algorithms is in form of O(IN w + OUT), where w is certain notion of width of the hypergraph Q [15,17,27,23]. However, it is not clear if this is optimal.…”

Section: Other Related Resultsmentioning

confidence: 99%

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

2019

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

View full text Add to dashboard Cite

Massively parallel join algorithms have received much attention in recent years, while most prior work has focused on worst-optimal algorithms. However, the worst-case optimality of these join algorithms relies on hard instances having very large output sizes, which rarely appear in practice. A stronger notion of optimality is output-optimal, which requires an algorithm to be optimal within the class of all instances sharing the same input and output size. An even stronger optimality is instance-optimal, i.e., the algorithm is optimal on every single instance, but this may not always be achievable.In the traditional RAM model of computation, the classical Yannakakis algorithm is instanceoptimal on any acyclic join. But in the massively parallel computation (MPC) model, the situation becomes much more complicated. We first show that for the class of r-hierarchical joins, instance-optimality can still be achieved in the MPC model. Then, we give a new MPC algorithm for an arbitrary acyclic join with load O ( IN p + √ IN·OUT p ), where IN, OUT are the input and output sizes of the join, and p is the number of servers in the MPC model. This improves the MPC version of the Yannakakis algorithm by an O( OUT IN ) factor. Furthermore, we show that this is output-optimal when OUT = O(p · IN), for every acyclic but non-r-hierarchical join. Finally, we give the first output-sensitive lower bound for the triangle join in the MPC model, showing that it is inherently more difficult than acyclic joins. other servers, receives messages from other servers, and then does some local computation. The complexity of the algorithm is measured by the number of rounds and the load, denoted as L, which is the maximum message size received by any server in any round., all problems can be solved trivially in one round by simply sending all data to one server. Initial efforts were mostly spent on what can be done in a single round of computation [3,26,7,8,24,26], but recently, more interest has been given to multi-round (but still a constant) algorithms [2,22,24], since new main memory based systems, such as Spark and Flink, have much lower overhead per round than previous generations like Hadoop.The MPC model can be considered as a simplified version of the BSP model [32], but it has enjoyed more popularity in recent years. This is mostly because the BSP model takes too many measures into consideration, such as communication costs, local computation time, memory consumption, etc. The MPC model unifies all these costs with one parameter L, which makes the model much simpler. Meanwhile, although L is defined as the maximum incoming message size of a server, it is also closely related with the local computation time and memory consumption, which are both increasing functions of L. Thus, L serves as a good surrogate of these other cost measures. This is also why the MPC model does not limit the outgoing message size of a server, which is less relevant to other costs.All our algorithms work under the mild assumption IN ≥ p 1+ where > 0 is any sma...

show abstract

Section: Other Related Resultsmentioning

confidence: 99%

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

2019

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

View full text Add to dashboard Cite

show abstract

“…If a feature extraction query has width w, then its data complexity isÕ(N w ) for a database of size N , whereÕ hides logarithmic factors in N . Various width measures have been proposed recently, such as: the fractional edge cover number [20,8,37,38,55] to capture the asymptotic size of the results for join queries and the time to compute them; the fractional hypertree width [32] and the submodular width [7] to capture the time to compute Boolean conjunctive queries; the factorization width [42] to capture the size of the factorized results of conjunctive queries; the FAQ-width [6] that extends the factorization width from conjunctive queries to functional aggregate queries; and the sharp-submodular width [2] that improves on the previous widths for functional aggregate queries.…”

Section: Structure-aware Learningmentioning

confidence: 99%

Learning Models over Relational Data: A Brief Tutorial

Schleich

Olteanu

Abo-Khamis³

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database.The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.

show abstract

“…Our implementation at LogicBlox makes use of generalizations of AGM to queries with func-tional dependencies and immaterialized predicates (such as a + b = c). These new bounds are based on a linear program whose variables are marginal entropies [4,5].…”

Section: Practical Implicationsmentioning

confidence: 99%

“…The second problem is to select a good variable ordering to run InsideOut on. In principle, one does not have to use the AGM-bound or the bounds from [4,5] to estimate the cost of an FAQ subquery. If one were to implement InsideOut inside any RDBMS, one could poll that RDBMS's optimizer to figure out the cost of a given variable ordering.…”

Section: Practical Implicationsmentioning

confidence: 99%

Juggling Functions Inside a Database

Khamis¹,

Ngo²,

Rudra

2017

SIGMOD Rec.

Self Cite

View full text Add to dashboard Cite

We define and study the Functional Aggregate Query (FAQ) problem, which captures common computational tasks across a very wide range of domains including relational databases, logic, matrix and tensor computation, probabilistic graphical models, constraint satisfaction, and signal processing. Simply put, an FAQ is a declarative way of defining a new function from a database of input functions.We present InsideOut, a dynamic programming algorithm, to evaluate an FAQ. The algorithm rewrites the input query into a set of easier-to-compute FAQ sub-queries. Each subquery is then evaluated using a worst-case optimal relational join algorithm. The topic of designing algorithms to optimally evaluate the classic multiway join problem has seen exciting developments in the past few years. Our framework tightly connects these new ideas in database theory with a vast number of application areas in a coherent manner, showing potentially that -with the right abstraction, blurring the distinction between data and computation -a good database engine can be a general purpose constraint solver, relational data store, graphical model inference engine, and matrix/tensor computation processor all at once.The InsideOut algorithm is very simple, as shall be described in this paper. Yet, in spite of solving an extremely general problem, its runtime either is as good as or improves upon the best known algorithm for the applications that FAQ specializes to. These corollaries include computational tasks in graphical model inference, matrix/tensor operations, relational joins, and logic. Better yet, InsideOut can be used within any database engine, because it is basically a principled way of rewriting queries. Indeed, it is already part of the LogicBlox database engine, helping efficiently answer traditional database queries, graphical model inference queries, and train a large class of machine learning models inside the database itself.

show abstract

What Do Shannon-type Inequalities, Submodular Width, and Disjunctive Datalog Have to Do with One Another?

Cited by 72 publications

References 54 publications

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Instance and Output Optimal Parallel Algorithms for Acyclic Joins

Learning Models over Relational Data: A Brief Tutorial

Juggling Functions Inside a Database

Contact Info

Product

Resources

About