The MADlib analytics library

Hellerstein, Joseph M.; Ré, Christoper; Schoppmann, Florian; Wang, Daisy Zhe; Fratkin, Eugene; Gorajek, Aleksander; Ng, Kee Siong; Welton, Caleb; Feng, Xixuan; Li, Kun; Kumar, Arun

doi:10.14778/2367502.2367510

Cited by 323 publications

(223 citation statements)

References 28 publications

Supporting

Mentioning

218

Contrasting

Unclassified

Order By: Relevance

“…The Potters Wheel tool [122] also supports column analysis, in particular, detecting data types and syntactic structures/patterns. Data profiling functionality is also included in the MADLib toolkit for scalable in-database analytics [71], including column statistics, such as count, count distinct, Recent data quality tools are dependency-driven: Classical dependencies, such as Fds and Inds, as well as their conditional extensions, may be used to express the intended data semantics, and dependency violations may indicate possible data quality problems. Most research systems require users to supply data quality rules and dependencies, such as GDR [138], Nadeef [34], Semandaq [45] and StreamClean [84].…”

Section: Research Toolsmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

Profiling data to determine metadata about a given dataset is an important and frequent activity of any IT professional and researcher and is necessary for various use-cases. It encompasses a vast array of methods to examine datasets and produce metadata. Among the simpler results are statistics, such as the number of null values and distinct values in a column, its data type, or the most frequent patterns of its data values. Metadata that are more difficult to compute involve multiple columns, namely correlations, unique column combinations, functional dependencies, and inclusion dependencies. Further techniques detect conditional properties of the dataset at hand. This survey provides a classification of data profiling tasks and comprehensively reviews the state of the art for each class. In addition, we review data profiling tools and systems from research and industry. We conclude with an outlook on the future of data profiling beyond traditional profiling tasks and beyond relational databases.

show abstract

Section: Research Toolsmentioning

confidence: 99%

Profiling relational data: a survey

Abedjan¹,

2015

View full text Add to dashboard Cite

show abstract

“…These include data mining toolkits from major RDBMS vendors, which integrate specific algorithms with an RDBMS [3,23]. Similar efforts exist for other data platforms [1].…”

Section: Analytics Systemsmentioning

confidence: 99%

Materialization Optimizations for Feature Selection Workloads

ZhangCe

KumarArun

RéChristopher

2016

ACM Trans. Database Syst.

Self Cite

101

View full text Add to dashboard Cite

There is an arms race in the data management industry to support analytics, in which one critical step is feature selection, the process of selecting a feature set that will be used to build a statistical model. Analytics is one of the biggest topics in data management, and feature selection is widely regarded as the most critical step of analytics; thus, we argue that managing the feature selection process is a pressing data management challenge. We study this challenge by describing a feature-selection language and a supporting prototype system that builds on top of current industrial, R-integration layers. From our interactions with analysts, we learned that feature selection is an interactive, human-in-the-loop process, which means that feature selection workloads are rife with reuse opportunities. Thus, we study how to materialize portions of this computation using not only classical database materialization optimizations but also methods that have not previously been used in database optimization, including structural decomposition methods (like QR factorization) and warmstart. These new methods have no analog in traditional SQL systems, but they may be interesting for array and scientific database applications. On a diverse set of data sets and programs, we find that traditional database-style approaches that ignore these new opportunities are more than two orders of magnitude slower than an optimal plan in this new tradeoff space across multiple R-backends. Furthermore, we show that it is possible to build a simple cost-based optimizer to automatically select a near-optimal execution plan for feature selection.

show abstract

“… DML Algorithms (fixed algorithm) : (for further clarification please refer to OptiML [23], SciDB [13][14][15][16][17][18][19][20][21][22] SystemML [12][13][14][15][16], SimSQL [14])…”

Section: A Distributed Machine Learning and Data Mining Techniquesmentioning

confidence: 99%

“… Large-Scale ML Libraries (fixed plan) : (for further clarification please refer to MLlib [19], Mahout [24], MADlib [15][16][17], ORE, Rev R)…”

Section: A Distributed Machine Learning and Data Mining Techniquesmentioning

confidence: 99%

Time-Saving Approach for Optimal Mining of Association Rules

Mohammed¹,

Balouki²,

Gadi³

2016

ijacsa

View full text Add to dashboard Cite

Abstract-Data mining is the process of analyzing data so as to get useful information to be exploited by users. Association rules is one of data mining techniques used to detect different correlations and to reveal relationships among data individual items in huge data bases. These rules usually take the following form: if X then Y as independent attributes. An association rule has become a popular technique used in several vital fields of activity such as insurance, medicine, banks, supermarkets… Association rules are generated in huge numbers by algorithms known as Association Rules Mining algorithms. The generation of huge quantities of Association Rules may be time-and-effort consuming this is the reason behind an urgent necessity of an efficient and scaling approach to mine only the relevant and significant association rules. This paper proposes an innovative approach which mines the optimal rules from a large set of Association Rules in a distributive processing way to improve its efficiency and to decrease the running time.

show abstract

The MADlib analytics library

Cited by 323 publications

References 28 publications

Profiling relational data: a survey

Profiling relational data: a survey

Materialization Optimizations for Feature Selection Workloads

Time-Saving Approach for Optimal Mining of Association Rules

Contact Info

Product

Resources

About