Unbiased split selection for classification trees based on the Gini Index

Strobl, Carolin; Boulesteix, Anne‐Laure; Augustin, Thomas

doi:10.1016/j.csda.2006.12.030

Cited by 257 publications

(179 citation statements)

References 18 publications

Supporting

Mentioning

178

Contrasting

Unclassified

Order By: Relevance

“…From that follows, that the larger the Gini gain, the larger the impurity reduction. Recently [9] showed that the use of Gini gain can lead to selection bias because categorical predictor variables with many categories are preferred over those with few categories. In the proposed framework this is not an obstacle due to the fact that the features are relations between sampled rectangles and therefore evaluate always to binary predictor variables.…”

Section: Tree Inductionmentioning

confidence: 99%

Randomized Tree Ensembles for Object Detection in Computational Pathology

Fuchs

Haybaeck²,

Wild

et al. 2009

Advances in Visual Computing

View full text Add to dashboard Cite

Randomized tree ensembles for object detection in computational pathology AbstractModern pathology broadly searches for biomarkers which are predictive for the survival of patients or the progression of cancer. Due to the lack of robust analysis algorithms this work is still performed manually by estimating staining on whole slides or tissue microarrays (TMA). Therefore, the design of decision support systems which can automate cancer diagnosis as well as objectify it pose a highly challenging problem for the medical imaging community. In this paper we propose Relational Detection Forests (RDF) as a novel object detection algorithm, which can be applied in an off-the-shelf manner to a large variety of tasks. The contributions of this work are twofold: (i) we describe a feature set which is able to capture shape information as well as local context. Furthermore, the feature set is guaranteed to be generally applicable due to its high flexibility.(ii) we present an ensemble learning algorithm based on randomized trees, which can cope with exceptionally high dimensional feature spaces in an efficient manner. Contrary to classical approaches, subspaces are not split based on thresholds but by learning relations between features. The algorithm is validated on tissue from 133 human clear cell renal cell carcinoma patients (ccRCC) and on murine liver samples of eight mice. On both species RDFs compared favorably to state of the art methods and approaches the detection accuracy of trained pathologists. Abstract. Modern pathology broadly searches for biomarkers which are predictive for the survival of patients or the progression of cancer. Due to the lack of robust analysis algorithms this work is still performed manually by estimating staining on whole slides or tissue microarrays (TMA). Therefore, the design of decision support systems which can automate cancer diagnosis as well as objectify it pose a highly challenging problem for the medical imaging community. In this paper we propose Relational Detection Forests (RDF) as a novel object detection algorithm, which can be applied in an off-the-shelf manner to a large variety of tasks. The contributions of this work are twofold: (i) we describe a feature set which is able to capture shape information as well as local context. Furthermore, the feature set is guaranteed to be generally applicable due to its high flexibility. (ii) we present an ensemble learning algorithm based on randomized trees, which can cope with exceptionally high dimensional feature spaces in an efficient manner. Contrary to classical approaches, subspaces are not split based on thresholds but by learning relations between features. The algorithm is validated on tissue from 133 human clear cell renal cell carcinoma patients (ccRCC) and on murine liver samples of eight mice. On both species RDFs compared favorably to state of the art methods and approaches the detection accuracy of trained pathologists. Randomized Tree Ensembles for Object Detection in Computational Pathology

show abstract

Section: Tree Inductionmentioning

confidence: 99%

Randomized Tree Ensembles for Object Detection in Computational Pathology

Fuchs

Haybaeck²,

Wild

et al. 2009

Advances in Visual Computing

View full text Add to dashboard Cite

show abstract

“…The RF provides a measure V I i of variable importance based on averaging the permutation importance measure of all the trees which is shown to be a reliable indicator [60]. The permutation importance measure is based on Out-Of-Bag (OOB) errors, and is utilized to select features.…”

Section: Feature Selectionmentioning

confidence: 99%

Classification of ALS Point Cloud with Improved Point Cloud Segmentation and Random Forests

Lin

Zhang

2017

Remote Sensing

View full text Add to dashboard Cite

This paper presents an automated and effective framework for classifying airborne laser scanning (ALS) point clouds. The framework is composed of four stages: (i) step-wise point cloud segmentation, (ii) feature extraction, (iii) Random Forests (RF) based feature selection and classification, and (iv) post-processing. First, a step-wise point cloud segmentation method is proposed to extract three kinds of segments, including planar, smooth and rough surfaces. Second, a segment, rather than an individual point, is taken as the basic processing unit to extract features. Third, RF is employed to select features and classify these segments. Finally, semantic rules are employed to optimize the classification result. Three datasets provided by Open Topography are utilized to test the proposed method. Experiments show that our method achieves a superior classification result with an overall classification accuracy larger than 91.17%, and kappa coefficient larger than 83.79%.

show abstract

“…However, this contribution is far less significant than that of words that do appear, particularly when the distribution of the class and the feature frequencies are highly unbalanced. Therefore, they eliminated the affection factor expressing words that do not appear, and adopted a measure of purity instead of impurity to emphasize the P(W) factor, namely Gini-A, as in expression (3).…”

Section: Gini-index Theory For Feature Selectionmentioning

confidence: 99%

“…And several researchers have indicated that feature selection was biased towards attributes with a large number of possible values, having more values, a larger number of categories, multiple-valued attributes, a large number of missing values, etc, and many studies on unbiased split selection have been introduced [6]. Recently, Carolin Strobl et al (2007) introduced unbiased split selection for classification trees based on the Gini-Index and a new split selection criterion that avoids variable selection bias on standard impurity measures, and Marco Sandri (2008) presented a simple and effective method for bias correction focused on the easily generalizable case of the Gini-Index [3], [7]. However, those were mostly concerning split selections, not feature selection in text classification.…”

Section: Introductionmentioning

confidence: 99%

Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

Park

Kwon

2011

IEICE Trans. Inf. & Syst.

View full text Add to dashboard Cite

Heum PARK†a) , Nonmember and Hyuk-Chul KWON †b) , Member SUMMARY This paper presents an improved Gini-Index algorithm to correct feature-selection bias in text classification. Gini-Index has been used as a split measure for choosing the most appropriate splitting attribute in decision tree. Recently, an improved Gini-Index algorithm for feature selection, designed for text categorization and based on Gini-Index theory, was introduced, and it has proved to be better than the other methods. However, we found that the Gini-Index still shows a feature selection bias in text classification, specifically for unbalanced datasets having a huge number of features. The feature selection bias of the Gini-Index in feature selection is shown in three ways: 1) the Gini values of low-frequency features are low (on purity measure) overall, irrespective of the distribution of features among classes, 2) for high-frequency features, the Gini values are always relatively high and 3) for specific features belonging to large classes, the Gini values are relatively lower than those belonging to small classes. Therefore, to correct that bias and improve feature selection in text classification using Gini-Index, we propose an improved Gini-Index (I-GI) algorithm with three reformulated Gini-Index expressions. In the present study, we used global dimensionality reduction (DR) and local DR to measure the goodness of features in feature selections. In experimental results for the I-GI algorithm, we obtained unbiased feature values and eliminated many irrelevant general features while retaining many specific features. Furthermore, we could improve the overall classification performances when we used the local DR method. The total averages of the classification performance were increased by 19.4 %, 15.9 %, 3.3 %, 2.8 % and 2.9 % (kNN) in Micro-F1, 14 %, 9.8 %, 9.2 %, 3.5 % and 4.3 % (SVM) in Micro-F1, 20 %, 16.9 %, 2.8 %, 3.6 % and 3.1 % (kNN) in Macro-F1, 16.3 %, 14 %, 7.1 %, 4.4 %, 6.3 % (SVM) in Macro-F1, compared with tf*idf, χ 2 , Information Gain, Odds Ratio and the existing Gini-Index methods according to each classifier.

show abstract

Unbiased split selection for classification trees based on the Gini Index

Cited by 257 publications

References 18 publications

Randomized Tree Ensembles for Object Detection in Computational Pathology

Randomized Tree Ensembles for Object Detection in Computational Pathology

Classification of ALS Point Cloud with Improved Point Cloud Segmentation and Random Forests

Improved Gini-Index Algorithm to Correct Feature-Selection Bias in Text Classification

Contact Info

Product

Resources

About