Interpretable decision-tree induction in a big data parallel framework

Weinberg, Abraham Itzhak

doi:10.1515/amcs-2017-0051

Cited by 11 publications

(8 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Edit-distance is a poor general comparator for diagnostic trees for several reasons. One is that existing algorithms do not take into account that nodes in the tree are not equally important (Jiang et al 1995;Weinberg and Last 2017 Trees B and C are each inconsistent with Tree A in exactly one way. Tree B swaps nodes 2 and 5, while Tree C swaps nodes 2 and 3.…”

Section: Edit-distance Based Techniquesmentioning

confidence: 99%

Statistical measurement of trees’ similarity

2020

View full text Add to dashboard Cite

Diagnostic theories are fundamental to Information Systems practice and are represented in trees. One way of creating diagnostic trees is by employing independent experts to construct such trees and compare them. However, good measures of similarity to compare diagnostic trees have not been identified. This paper presents an analysis of the suitability of various measures of association to determine the similarity of two diagnostic trees using bootstrap simulations. We find that three measures of association, Goodman and Kruskal's Lambda, Cohen's Kappa, and Goodman and Kruskal's Gamma (J Am Stat Assoc 49(268):732-764, 1954) each behave differently depending on what is inconsistent between the two trees thus providing both measures for assessing alignment between two trees developed by independent experts as well as identifying the causes of the differences.

show abstract

Section: Edit-distance Based Techniquesmentioning

confidence: 99%

Statistical measurement of trees’ similarity

2020

View full text Add to dashboard Cite

show abstract

“…MapReduce (Dean and Ghemawat, 2008) is one of the popular programming models focusing on automatic data-flow parallelism. It is a popular choice to perform big data analysis with data mining algorithms in a parallel distributed computing environment (Weinberg and Last, 2017). The MapReduce programming model has proven a significant decrease in the execution time of computing-intensive workflows or processes when executing in a distributed parallel environment, e.g., Hadoop (González-Vélez and Kontagora, 2011).…”

Section: Related Workmentioning

confidence: 99%

Parallelizing user–defined functions in the ETL workflow using orchestration style sheets

Ali

Mey²,

Thiele³

2019

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

Today’s ETL tools provide capabilities to develop custom code as user-defined functions (UDFs) to extend the expressiveness of the standard ETL operators. However, while this allows us to easily add new functionalities, it also comes with the risk that the custom code is not intended to be optimized, e.g., by parallelism, and for this reason, it performs poorly for data-intensive ETL workflows. In this paper we present a novel framework, which allows the ETL developer to choose a design pattern in order to write parallelizable code and generates a configuration for the UDFs to be executed in a distributed environment. This enables ETL developers with minimum expertise in distributed and parallel computing to develop UDFs without taking care of parallelization configurations and complexities. We perform experiments on large-scale datasets based on TPC-DS and BigBench. The results show that our approach significantly reduces the effort of ETL developers and at the same time generates efficient parallel configurations to support complex and data-intensive ETL tasks.

show abstract

“…Although the number of the algorithms for data stream mining is not as large as in the case of traditional data mining, in the recent decade there has been a considerable progress in this field. The most successful seem algorithms based on decision trees (Domingos and Hulten, 2000;Jaworski et al, 2017;Rutkowski et al, 2015;Weinberg and Last, 2017) and ensemble methods (Pietruczuk et al, 2017;Wang et al, 2003). They are mainly devoted to data classification problems.…”

Section: Introductionmentioning

confidence: 99%

Regression Function and Noise Variance Tracking Methods for Data Streams with Concept Drift

Jaworski

2018

International Journal of Applied Mathematics and Computer Science

View full text Add to dashboard Cite

Two types of heuristic estimators based on Parzen kernels are presented. They are able to estimate the regression function in an incremental manner. The estimators apply two techniques commonly used in concept-drifting data streams, i.e., the forgetting factor and the sliding window. The methods are applicable for models in which both the function and the noise variance change over time. Although nonparametric methods based on Parzen kernels were previously successfully applied in the literature to online regression function estimation, the problem of estimating the variance of noise was generally neglected. It is sometimes of profound interest to know the variance of the signal considered, e.g., in economics, but it can also be used for determining confidence intervals in the estimation of the regression function, as well as while evaluating the goodness of fit and in controlling the amount of smoothing. The present paper addresses this issue. Specifically, variance estimators are proposed which are able to deal with concept drifting data by applying a sliding window and a forgetting factor, respectively. A number of conducted numerical experiments proved that the proposed methods perform satisfactorily well in estimating both the regression function and the variance of the noise.

show abstract

Interpretable decision-tree induction in a big data parallel framework

Cited by 11 publications

References 32 publications

Statistical measurement of trees’ similarity

Statistical measurement of trees’ similarity

Parallelizing user–defined functions in the ETL workflow using orchestration style sheets

Regression Function and Noise Variance Tracking Methods for Data Streams with Concept Drift

Contact Info

Product

Resources

About