A novel application of Hoeffding's inequality to decision trees construction for data streams

Duda, Piotr; Jaworski, Maciej; Pietruczuk, Lena; Rutkowski, Leszek

doi:10.1109/ijcnn.2014.6889806

Cited by 13 publications

(10 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…First, we compared the empirical behaviour of all our algorithms on the two-dimensional dataset banana, 8 in Figure 1. The simplicity of this dataset allows us to show visually the difference between the four algorithms.…”

Section: B Comparison Among Our Methodsmentioning

confidence: 99%

See 1 more Smart Citation

The ABACOC Algorithm: A Novel Approach for Nonparametric Classification of Data Streams

Rosa

Orabona

Cesa-Bianchi

2015

2015 IEEE International Conference on Data Mining

View full text Add to dashboard Cite

Abstract-Stream mining poses unique challenges to machine learning: predictive models are required to be scalable, incrementally trainable, must remain bounded in size (even when the data stream is arbitrarily long), and be nonparametric in order to achieve high accuracy even in complex and dynamic environments. Moreover, the learning system must be parameterless -traditional tuning methods are problematic in streaming settings-and avoid requiring prior knowledge of the number of distinct class labels occurring in the stream. In this paper, we introduce a new algorithmic approach for nonparametric learning in data streams. Our approach addresses all above mentioned challenges by learning a model that covers the input space using simple local classifiers. The distribution of these classifiers dynamically adapts to the local (unknown) complexity of the classification problem, thus achieving a good balance between model complexity and predictive accuracy. We design four variants of our approach of increasing adaptivity. By means of an extensive empirical evaluation against standard nonparametric baselines, we show state-of-the-art results in terms of accuracy versus model size. For the variant that imposes a strict bound on the model size, we show better performance against all other methods measured at the same model size value. Our empirical analysis is complemented by a theoretical performance guarantee which does not rely on any stochastic assumption on the source generating the stream.

show abstract

Section: B Comparison Among Our Methodsmentioning

confidence: 99%

“…Incremental decision and rule tree learning systems, such as Very Fast Decision Tree (VFDT) [7] and Decision Rules (RULES) [12] which use an incremental version of the split function computation -see also [22], [19], [8], [4].…”

Section: Related Workmentioning

confidence: 99%

The ABACOC Algorithm: A Novel Approach for Nonparametric Classification of Data Streams

Rosa

Orabona

Cesa-Bianchi

2015

2015 IEEE International Conference on Data Mining

View full text Add to dashboard Cite

show abstract

“…We ran experiments on synthetic datasets and popular benchmarks, comparing our C-Tree (Algorithm 1) against two baselines: H-Tree (VDFT algorithm [7]) and CorrH-Tree (the method from [8] using the classification error as splitting criterion). The bounds of [28] are not considered because of their conservativeness.…”

Section: Full Sampling Experimentsmentioning

confidence: 99%

“…Alternative approaches, such as NIP-H e NIP-N, use Gaussian approximations instead of Hoeffding bounds in order to compute confidence intervals. Several extensions of VFDT have been proposed, also taking into account non-stationary data sources -see, e.g., [10], [9], [2], [35], [27], [15], [19], [21], [11], [34], [20], [29], [8]. All these methods are based on the classical Hoeffding bound [14]: after m independent observations of a random variable taking values in a real interval of size R, with probability at least 1 − δ the true mean does not differ from the sample mean by more than…”

Section: Introductionmentioning

confidence: 99%

Confidence Decision Trees via Online and Active Learning for Streaming Data

Rosa¹,

Cesa-Bianchi²

2017

jair

View full text Add to dashboard Cite

Decision tree classifiers are a widely used tool in data stream mining. The use of confidence intervals to estimate the gain associated with each split leads to very effective methods, like the popular Hoeffding tree algorithm. From a statistical viewpoint, the analysis of decision tree classifiers in a streaming setting requires knowing when enough new information has been collected to justify splitting a leaf. Although some of the issues in the statistical analysis of Hoeffding trees have been already clarified, a general and rigorous study of confidence intervals for splitting criteria is missing. We fill this gap by deriving accurate confidence intervals to estimate the splitting gain in decision tree learning with respect to three criteria: entropy, Gini index, and a third index proposed by Kearns and Mansour. Our confidence intervals depend in a more detailed way on the tree parameters. We also extend our confidence analysis to a selective sampling setting, in which the decision tree learner adaptively decides which labels to query in the stream. We furnish theoretical guarantee bounding the probability that the classification is nonoptimal learning the decision tree via our selective sampling strategy. Experiments on real and synthetic data in a streaming setting show that our trees are indeed more accurate than trees with the same number of leaves generated by other techniques and our active learning module permits to save labeling cost. In addition, comparing our labeling strategy with recent methods, we show that our approach is more robust and consistent respect all the other techniques applied to incremental decision trees. 1

show abstract

“…This problem has become particularly important in re-cent years as the number of collected data has increased. Therefore, the researchers paid special attention to a field of artificial intelligence called data streams mining (DSM) [1][2][3][4][5][6][7][8][9][10][11][12][13]. In the data stream scenario, instead of the static training set, we assume that the data come to the system continuously, one after the other.…”

Section: Introductionmentioning

confidence: 99%

A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment

Duda

Przybyszewski

Wang

2020

Journal of Artificial Intelligence and Soft Computing Research

Self Cite

View full text Add to dashboard Cite

The training set consists of many features that influence the classifier in different degrees. Choosing the most important features and rejecting those that do not carry relevant information is of great importance to the operating of the learned model. In the case of data streams, the importance of the features may additionally change over time. Such changes affect the performance of the classifier but can also be an important indicator of occurring concept-drift. In this work, we propose a new algorithm for data streams classification, called Random Forest with Features Importance (RFFI), which uses the measure of features importance as a drift detector. The RFFT algorithm implements solutions inspired by the Random Forest algorithm to the data stream scenarios. The proposed algorithm combines the ability of ensemble methods for handling slow changes in a data stream with a new method for detecting concept drift occurrence. The work contains an experimental analysis of the proposed algorithm, carried out on synthetic and real data.

show abstract

A novel application of Hoeffding's inequality to decision trees construction for data streams

Cited by 13 publications

References 29 publications

The ABACOC Algorithm: A Novel Approach for Nonparametric Classification of Data Streams

The ABACOC Algorithm: A Novel Approach for Nonparametric Classification of Data Streams

Confidence Decision Trees via Online and Active Learning for Streaming Data

A Novel Drift Detection Algorithm Based on Features’ Importance Analysis in a Data Streams Environment

Contact Info

Product

Resources

About