2017
DOI: 10.1515/amcs-2017-0051
|View full text |Cite
|
Sign up to set email alerts
|

Interpretable decision-tree induction in a big data parallel framework

Abstract: When running data-mining algorithms on big data platforms, a parallel, distributed framework, such as MAPREDUCE, may be used. However, in a parallel framework, each individual model fits the data allocated to its own computing node without necessarily fitting the entire dataset. In order to induce a single consistent model, ensemble algorithms such as majority voting, aggregate the local models, rather than analyzing the entire dataset directly. Our goal is to develop an efficient algorithm for choosing one re… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
8
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
7
1

Relationship

1
7

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 32 publications
0
8
0
Order By: Relevance
“…Edit-distance is a poor general comparator for diagnostic trees for several reasons. One is that existing algorithms do not take into account that nodes in the tree are not equally important (Jiang et al 1995;Weinberg and Last 2017 Trees B and C are each inconsistent with Tree A in exactly one way. Tree B swaps nodes 2 and 5, while Tree C swaps nodes 2 and 3.…”
Section: Edit-distance Based Techniquesmentioning
confidence: 99%
“…Edit-distance is a poor general comparator for diagnostic trees for several reasons. One is that existing algorithms do not take into account that nodes in the tree are not equally important (Jiang et al 1995;Weinberg and Last 2017 Trees B and C are each inconsistent with Tree A in exactly one way. Tree B swaps nodes 2 and 5, while Tree C swaps nodes 2 and 3.…”
Section: Edit-distance Based Techniquesmentioning
confidence: 99%
“…MapReduce (Dean and Ghemawat, 2008) is one of the popular programming models focusing on automatic data-flow parallelism. It is a popular choice to perform big data analysis with data mining algorithms in a parallel distributed computing environment (Weinberg and Last, 2017). The MapReduce programming model has proven a significant decrease in the execution time of computing-intensive workflows or processes when executing in a distributed parallel environment, e.g., Hadoop (González-Vélez and Kontagora, 2011).…”
Section: Related Workmentioning
confidence: 99%
“…Although the number of the algorithms for data stream mining is not as large as in the case of traditional data mining, in the recent decade there has been a considerable progress in this field. The most successful seem algorithms based on decision trees (Domingos and Hulten, 2000;Jaworski et al, 2017;Rutkowski et al, 2015;Weinberg and Last, 2017) and ensemble methods (Pietruczuk et al, 2017;Wang et al, 2003). They are mainly devoted to data classification problems.…”
Section: Introductionmentioning
confidence: 99%