Learning Generalized Linear Models Over Normalized Data

Kumar, Arun; Naughton, Jeffrey F.; Patel, Jignesh M.

doi:10.1145/2723372.2723713

Cited by 126 publications

(118 citation statements)

References 25 publications

Supporting

Mentioning

118

Contrasting

Order By: Relevance

“…In-database machine learning algorithms is a growing class of algorithms that aims to learn in time sublinear in the input data a.k.a. the design matrix [22,2,11,3,18,19]. The trick is that the design matrix J often happens to be the output of some database query Q whose size could be much larger than the size of its input tables T 1 , .…”

Section: Related Resultsmentioning

confidence: 99%

See 1 more Smart Citation

On Functional Aggregate Queries with Additive Inequalities

Khamis¹,

Curtin²,

Moseley

et al. 2019

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

View full text Add to dashboard Cite

We consider the problem of evaluating an aggregation query, which is a sum-of-sum query or a sum-of-product query, subject to additive inequalities. Such aggregation queries, with a smallish number of additive inequalities, arise naturally/commonly in many applications, particularly in machine learning applications. We give a relatively complete categorization of the computational complexity of such problems. We first show that the problem is NP-hard, even in the case of one additive inequality. Thus we turn to approximating the query. Our main result is an efficient algorithm for approximating, with arbitrarily small relative error, many natural aggregation queries with one additive inequality. We give examples of natural queries that can be efficiently solved using this algorithm. In contrast we show that the situation with two additive inequalities is quite different, by showing that it is NP-hard to evaluate simple aggregation queries, with two additive inequalities, with any bounded relative error. We end by considering the problem of computing the gradient of the objective function in the Support Vector Machines (SVM) problem, a canonical machine learning problem. While computing the gradient for SVM can be reduced to the problem of computing an aggregation query with one additive inequality, our algorithm is not applicable due to what we call the "subtraction problem". However, we show how to circumvent this subtraction problem within the context of SVM to obtain a gradient-descent algorithm that will result in an approximately correct optimal solution, using an alternative notion of approximate correctness, which may be of independent interest.

show abstract

Section: Related Resultsmentioning

confidence: 99%

“…Example: Let ǫ = 1 and A be an array having the weights (10,12,13,14,15,16,17,18,19), then the ǫ-sketch of A, denoted A ′ , will be an array that has the weights of indices (1,2,4,8) in A (the weights of index 16 or higher are assumed to be ∞); thus, A ′ = {10, 12, 14, 14, 18, 18, 18, 18}.…”

Section: Approximate Inequality Row Countingmentioning

confidence: 99%

On Functional Aggregate Queries with Additive Inequalities

Khamis¹,

Curtin²,

Moseley

et al. 2019

Proceedings of the 38th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

View full text Add to dashboard Cite

show abstract

“…Further examples in this category are: Orion [30] and Hamlet [31], which support generalized linear models and Naïve Bayes classi cation; recent e orts on scaling linear algebra using existing distributed database systems [32]; the declarative language BUDS [20], whose compiler can perform deep optimizations of the user's program; and Morpheus [14]. Morpheus factorizes the computation of linear algebra operators summation, matrix-multiplication, pseudo-inverse, and element-wise operations over training datasets de ned by key-foreign key star or chain joins.…”

Section: Related Workmentioning

confidence: 99%

A Layered Aggregate Engine for Analytics Workloads

Schleich

Olteanu

Khamis³

et al. 2019

Proceedings of the 2019 International Conference on Management of Data

View full text Add to dashboard Cite

is paper introduces LMFAO (Layered Multiple Functional Aggregate Optimization), an in-memory optimization and execution engine for batches of aggregates over the input database. e primary motivation for this work stems from the observation that for a variety of analytics over databases, their data-intensive tasks can be decomposed into groupby aggregates over the join of the input database relations. We exemplify the versatility and competitiveness of LMFAO for a handful of widely used analytics: learning ridge linear regression, classi cation trees, regression trees, and the structure of Bayesian networks using Chow-Liu trees; and data cubes used for exploration in data warehousing.LMFAO consists of several layers of logical and code optimizations that systematically exploit sharing of computation, parallelism, and code specialization.We conducted two types of performance benchmarks. In experiments with four datasets, LMFAO outperforms by several orders of magnitude on one hand, a commercial database system and MonetDB for computing batches of aggregates, and on the other hand, TensorFlow, Scikit, R, and AC/DC for learning a variety of models over databases.

show abstract

“…The database community has identified various opportunities for optimizing DPR. Several approaches identify as a key bottleneck in DPR and optimize it [37,15,49,38]. Kumar et al [37] optimizes generalized linear models directly over factorized / normalized representations of relational data, avoiding key-foreign key joins.…”

Section: Related Workmentioning

confidence: 99%

Helix

Xin

Macke

et al. 2018

Proc. VLDB Endow.

View full text Add to dashboard Cite

Machine learning workflow development is a process of trial-anderror: developers iterate on workflows by testing out small modifications until the desired accuracy is achieved. Unfortunately, existing machine learning systems focus narrowly on model training-a small fraction of the overall development time-and neglect to address iterative development. We propose HELIX, a machine learning system that optimizes the execution across iterations-intelligently caching and reusing, or recomputing intermediates as appropriate. HELIX captures a wide variety of application needs within its Scala DSL, with succinct syntax defining unified processes for data preprocessing, model specification, and learning. We demonstrate that the reuse problem can be cast as a MAX-FLOW problem, while the caching problem is NP-HARD. We develop effective lightweight heuristics for the latter. Empirical evaluation shows that HELIX is not only able to handle a wide variety of use cases in one unified workflow but also much faster, providing run time reductions of up to 19× over state-of-the-art systems, such as DeepDive or KeystoneML, on four real-world applications in natural language processing, computer vision, social and natural sciences.PVLDB Reference Format:

show abstract

Learning Generalized Linear Models Over Normalized Data

Cited by 126 publications

References 25 publications

On Functional Aggregate Queries with Additive Inequalities

On Functional Aggregate Queries with Additive Inequalities

A Layered Aggregate Engine for Analytics Workloads

Helix

Contact Info

Product

Resources

About