2007
DOI: 10.1016/j.csda.2006.12.030
|View full text |Cite
|
Sign up to set email alerts
|

Unbiased split selection for classification trees based on the Gini Index

Abstract: Classification trees are a popular tool in applied statistics because their heuristic search approach based on impurity reduction is easy to understand and the interpretation of the output is straightforward. However, all standard algorithms suffer from a major problem: variable selection based on standard impurity measures as the Gini Index is biased. The bias is such that, e.g., splitting variables with a high amount of missing values-even if missing completely at random (MCAR)-are artificially preferred. A … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
178
0
1

Year Published

2007
2007
2024
2024

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 257 publications
(179 citation statements)
references
References 18 publications
0
178
0
1
Order By: Relevance
“…From that follows, that the larger the Gini gain, the larger the impurity reduction. Recently [9] showed that the use of Gini gain can lead to selection bias because categorical predictor variables with many categories are preferred over those with few categories. In the proposed framework this is not an obstacle due to the fact that the features are relations between sampled rectangles and therefore evaluate always to binary predictor variables.…”
Section: Tree Inductionmentioning
confidence: 99%
“…From that follows, that the larger the Gini gain, the larger the impurity reduction. Recently [9] showed that the use of Gini gain can lead to selection bias because categorical predictor variables with many categories are preferred over those with few categories. In the proposed framework this is not an obstacle due to the fact that the features are relations between sampled rectangles and therefore evaluate always to binary predictor variables.…”
Section: Tree Inductionmentioning
confidence: 99%
“…The RF provides a measure V I i of variable importance based on averaging the permutation importance measure of all the trees which is shown to be a reliable indicator [60]. The permutation importance measure is based on Out-Of-Bag (OOB) errors, and is utilized to select features.…”
Section: Feature Selectionmentioning
confidence: 99%
“…However, this contribution is far less significant than that of words that do appear, particularly when the distribution of the class and the feature frequencies are highly unbalanced. Therefore, they eliminated the affection factor expressing words that do not appear, and adopted a measure of purity instead of impurity to emphasize the P(W) factor, namely Gini-A, as in expression (3).…”
Section: Gini-index Theory For Feature Selectionmentioning
confidence: 99%
“…And several researchers have indicated that feature selection was biased towards attributes with a large number of possible values, having more values, a larger number of categories, multiple-valued attributes, a large number of missing values, etc, and many studies on unbiased split selection have been introduced [6]. Recently, Carolin Strobl et al (2007) introduced unbiased split selection for classification trees based on the Gini-Index and a new split selection criterion that avoids variable selection bias on standard impurity measures, and Marco Sandri (2008) presented a simple and effective method for bias correction focused on the easily generalizable case of the Gini-Index [3], [7]. However, those were mostly concerning split selections, not feature selection in text classification.…”
Section: Introductionmentioning
confidence: 99%