A Layered Aggregate Engine for Analytics Workloads

Schleich, Maximilian; Olteanu, Dan; Khamis, Mahmoud Abo; Ngo, Hung Q.; Nguyen, XuanLong

doi:10.1145/3299869.3324961

Cited by 52 publications

(57 citation statements)

References 44 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent work puts forward new optimization and evaluation strategies that go beyond the capabilities of existing database management systems. Recent experiments confirm this observation: Whereas existing query processing techniques are mature at executing one query, they miss opportunities for systematically sharing computation across several queries in a batch [50].…”

Section: Structure-aware Learningmentioning

confidence: 85%

“…The tightly-integrated systems F [51], AC/DC [3], and LMFAO [50] are data structure-aware in that they exploit the structure and sparsity of the database to lower the complexity and drastically improve the runtime performance of the learning process. In contrast, we call all the other systems structure-agnostic, since they do not exploit properties of the input database.…”

Section: Structure-aware Learningmentioning

confidence: 99%

“…The model aggregates over the feature extraction query define a batch of queries. In practice, for training datasets with tens of features, query batch sizes can be in the order of: hundreds to thousands for ridge linear regression; thousands for computing a decision tree node; and tens for an assignment step in k-means clustering [50]. The result of a query batch is then the input to an optimizer such as a gradient descent method that iterates until the model parameters converge.…”

Section: Structure-aware Learningmentioning

confidence: 99%

“…Besides exploiting the structure of the input data and the learning task, the problem of learning models over databases can also benefit tremendously from database system techniques. Recent work [50] showed non-trivial speedups (several orders of magnitude) brought by code optimization for machine learning workloads over state-of-the-art systems such as TensorFlow [1], R [46], Scikit-learn [44], and mlpack [13]. Prime examples of code optimizations leading to such performance improvements include:…”

Section: Database Systems Considerationsmentioning

confidence: 99%

See 3 more Smart Citations

Learning Models over Relational Data: A Brief Tutorial

Schleich

Olteanu

Abo-Khamis³

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

This tutorial overviews the state of the art in learning models over relational databases and makes the case for a first-principles approach that exploits recent developments in database research.The input to learning classification and regression models is a training dataset defined by feature extraction queries over relational databases. The mainstream approach to learning over relational data is to materialize the training dataset, export it out of the database, and then learn over it using a statistical package. This approach can be expensive as it requires the materialization of the training dataset. An alternative approach is to cast the machine learning problem as a database problem by transforming the data-intensive component of the learning task into a batch of aggregates over the feature extraction query and by computing this batch directly over the input database.The tutorial highlights a variety of techniques developed by the database theory and systems communities to improve the performance of the learning task. They rely on structural properties of the relational data and of the feature extraction query, including algebraic (semi-ring), combinatorial (hypertree width), statistical (sampling), or geometric (distance) structure. They also rely on factorized computation, code specialization, query compilation, and parallelization.

show abstract

Section: Structure-aware Learningmentioning

confidence: 85%

Section: Structure-aware Learningmentioning

confidence: 99%

Section: Structure-aware Learningmentioning

confidence: 99%

Section: Database Systems Considerationsmentioning

confidence: 99%

See 2 more Smart Citations

Learning Models over Relational Data: A Brief Tutorial

Schleich

Olteanu

Abo-Khamis³

et al. 2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…This work opens exciting avenues of future research. New technical developments include: a compilation approach to capture LMFAO's efficient support for categorical variables with multi-root join trees and group-by aggregates [47]; support for parallelization and many-core architectures; and an investigation of the trade-off between runtime performance and size of generated C++ code for models with high degree and many parameters (e.g., factorization machines). We would also like to improve the usability of IFAQ as follows: build an IFAQ library of optimization algorithms and ML models beyond the simple ones discussed in this paper and including boosting trees, random forests, and neural networks; generate optimized code for model selection over different subsets of the given variables; allow IFAQ to work directly on Jupyter notebooks that specify the construction of the data matrix and the model training; and investigate whether the IFAQ compilation techniques can be incorporated into popular data science tools such as Scikit and TensorFlow.…”

Section: Discussionmentioning

confidence: 99%

Multi-layer optimizations for end-to-end data analytics

Shaikhha

Schleich

Ghita

et al. 2020

Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization

Self Cite

View full text Add to dashboard Cite

We consider the problem of training machine learning models over multi-relational data. The mainstream approach is to first construct the training dataset using a feature extraction query over input database and then use a statistical software package of choice to train the model. In this paper we introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach. IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language, which captures a subset of Python commonly used in Jupyter notebooks for rapid prototyping of machine learning applications. The program is subject to several layers of IFAQ optimizations, such as algebraic transformations, loop transformations, schema specialization, data layout optimizations, and finally compilation into efficient low-level C++ code specialized for the given workload and data.We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and TensorFlow by several orders of magnitude for linear regression and regression tree models over several relational datasets.CCS Concepts • Computing methodologies → Supervised learning by regression; • Information systems → Database management system engines; • Software and its engineering → Domain specific languages.

show abstract

Formal semantics and high performance in declarative machine learning using Datalog

Wang

et al. 2021

The VLDB Journal

View full text Add to dashboard Cite

With an escalating arms race to adopt machine learning (ML) in diverse application domains, there is an urgent need to support declarative machine learning over distributed data platforms. Toward this goal, a new framework is needed where users can specify ML tasks in a manner where programming is decoupled from the underlying algorithmic and system concerns. In this paper, we argue that declarative abstractions based on Datalog are natural fits for machine learning and propose a purely declarative ML framework with a Datalog query interface. We show that using aggregates in recursive Datalog programs entails a concise expression of ML applications, while providing a strictly declarative formal semantics. This is achieved by introducing simple conditions under which the semantics of recursive programs is guaranteed to be equivalent to that of aggregate-stratified ones. We further provide specialized compilation and planning techniques for semi-naive fixpoint computation in the presence of aggregates and optimization strategies that are effective on diverse recursive programs and distributed data platforms. To test and demonstrate these research advances, we have developed a powerful and user-friendly system on top of Apache Spark. Extensive evaluations on large-scale datasets illustrate that this approach will achieve promising performance gains while improving both programming flexibility and ease of development and deployment for ML applications.

show abstract

A Layered Aggregate Engine for Analytics Workloads

Cited by 52 publications

References 44 publications

Learning Models over Relational Data: A Brief Tutorial

Learning Models over Relational Data: A Brief Tutorial

Multi-layer optimizations for end-to-end data analytics

Formal semantics and high performance in declarative machine learning using Datalog

Contact Info

Product

Resources

About