Materialization Optimizations for Feature Selection Workloads

ZhangCe,; KumarArun,; RéChristopher,

doi:10.1145/2877204

Cited by 101 publications

(87 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Application in a System: Recent systems such as Columbus [20,33] and MLBase [21] provide a high-level language that includes both relational and ML operations. Such systems optimize the execution of logical ML computations by choosing among alternative physical plans using cost models.…”

Section: Resultsmentioning

confidence: 99%

“…There is increasing research and industrial interest in building systems that achieve closer integration of ML with data processing. These include systems that combine linear algebra-based languages with data management platforms [4,15,34], systems for Bayesian inference [9], systems for graph-based ML [23], and systems that combine dataflow-based languages for ML with data management platforms [21,22,33]. None of these systems address the problem of learning over joins, but we think our work is easily applicable to the last group of systems.…”

Section: Related Workmentioning

confidence: 99%

“…Request permissions from permissions@acm.org. and academia aim to integrate ML capabilities with data processing in RDBMSs, Hadoop, and other systems [2,4,9,15,18,21,22,33,34]. Almost all such implementations of ML algorithms require that the input dataset be a single table.…”

Section: Introductionmentioning

confidence: 99%

“…Second, as the base tables evolve, maintaining the materialized output of the join could become an overhead. Finally, analysts often perform exploratory analysis of different subsets of features and data [20,33]. Materializing temporary tables after joins for learning on each subset could slow the analyst and inhibit exploration [7].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Learning Generalized Linear Models Over Normalized Data

Kumar

Naughton

Patel

2015

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Self Cite

126

117

View full text Add to dashboard Cite

Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume that the input to a learning algorithm is a single table. However, most relational datasets are not stored as single tables due to normalization. Thus, analysts often perform key-foreign key joins before learning on the join output. This strategy of learning after joins introduces redundancy avoided by normalization, which could lead to poorer end-to-end performance and maintenance overheads due to data duplication. In this work, we take a step towards enabling and optimizing learning over joins for a common class of machine learning techniques called generalized linear models that are solved using gradient descent algorithms in an RDBMS setting. We present alternative approaches to learn over a join that are easy to implement over existing RDBMSs. We introduce a new approach named factorized learning that pushes ML computations through joins and avoids redundancy in both I/O and computations. We study the tradeoff space for all our approaches both analytically and empirically. Our results show that factorized learning is often substantially faster than the alternatives, but is not always the fastest, necessitating a cost-based approach. We also discuss extensions of all our approaches to multi-table joins as well as to Hive.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning Generalized Linear Models Over Normalized Data

Kumar

Naughton

Patel

2015

Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Self Cite

126

117

View full text Add to dashboard Cite

show abstract

“…There are different types of DML such as: Tasks : ( for further clarification please refer to MLbase [18,21], (fixed task) Columbus [25], DeepDive [20])…”

Section: A Distributed Machine Learning and Data Mining Techniquesmentioning

confidence: 99%

Time-Saving Approach for Optimal Mining of Association Rules

Mohammed¹,

Balouki²,

Gadi³

2016

ijacsa

View full text Add to dashboard Cite

Abstract-Data mining is the process of analyzing data so as to get useful information to be exploited by users. Association rules is one of data mining techniques used to detect different correlations and to reveal relationships among data individual items in huge data bases. These rules usually take the following form: if X then Y as independent attributes. An association rule has become a popular technique used in several vital fields of activity such as insurance, medicine, banks, supermarkets… Association rules are generated in huge numbers by algorithms known as Association Rules Mining algorithms. The generation of huge quantities of Association Rules may be time-and-effort consuming this is the reason behind an urgent necessity of an efficient and scaling approach to mine only the relevant and significant association rules. This paper proposes an innovative approach which mines the optimal rules from a large set of Association Rules in a distributive processing way to improve its efficiency and to decrease the running time.

show abstract

Impact of Modeling Production Knowledge for a Data Based Prediction of Transition Times

Schuh

Prote

Hünnekes

et al. 2019

IFIP Advances in Information and Communication Technology

View full text Add to dashboard Cite

An increasing demand for customer-specific products is a major challenge for manufacturing companies. In many cases, companies attempt to satisfy this demand by increasing the number of product variants. In those companies, cost-oriented production processes have to be transformed into flexible workshop or island production structures in order to be able to produce this variety. This leads to an increasing complexity of production and subsequently planning. In order to reliably meet due dates, it is necessary to improve the quality of planning. This paper presents an approach for predicting transition times, the times between two production steps, by employing machine learning methods. In particular, the influence of the modelling of production knowledge of experienced employees on the prediction quality compared to a pure optimization of the methods' parameters is investigated.

show abstract

Materialization Optimizations for Feature Selection Workloads

Cited by 101 publications

References 28 publications

Learning Generalized Linear Models Over Normalized Data

Learning Generalized Linear Models Over Normalized Data

Time-Saving Approach for Optimal Mining of Association Rules

Impact of Modeling Production Knowledge for a Data Based Prediction of Transition Times

Contact Info

Product

Resources

About