2000
DOI: 10.1023/a:1007631014630
|View full text |Cite
|
Sign up to set email alerts
|

Untitled

Abstract: Abstract.A single mechanism is responsible for three pathologies of induction algorithms: attribute selection errors, overfitting, and oversearching. In each pathology, induction algorithms compare multiple items based on scores from an evaluation function and select the item with the maximum score. We call this a multiple comparison procedure (MCP). We analyze the statistical properties of MCPs and show how failure to adjust for these properties leads to the pathologies. We also discuss approaches that can co… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
17
0

Year Published

2005
2005
2021
2021

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 165 publications
(17 citation statements)
references
References 40 publications
0
17
0
Order By: Relevance
“…The Classification and Regression Tree (CART) algorithm is the most widely used algorithm to construct a Random Forest. Some studies, however, recognized a bias, with respect to variable selection, toward variables with different scales and many possible splits within the CART algorithm [40][41][42][43][44][45]. Hence, the Conditional Inference Tree (CIT) algorithm was developed to overcome this bias and improve the interpretability of the trees [46].…”
Section: Random Forestsmentioning
confidence: 99%
“…The Classification and Regression Tree (CART) algorithm is the most widely used algorithm to construct a Random Forest. Some studies, however, recognized a bias, with respect to variable selection, toward variables with different scales and many possible splits within the CART algorithm [40][41][42][43][44][45]. Hence, the Conditional Inference Tree (CIT) algorithm was developed to overcome this bias and improve the interpretability of the trees [46].…”
Section: Random Forestsmentioning
confidence: 99%
“…An example of this would be to determine at first the relevant inputs to use with the forecasting methods, such as conducting feature selections using a wrapper approach [26]. These methods may have prohibitive computational cost when working with the full datasets, while increasing the risk of oversearching the space of forecasting methods [27]. Working on smaller but representative subsets for hyperparameter tuning or feature selection allows the computation time to be reduced, while optimizing over only a small part of the full training set, keeping the rest of the training set untouched for the final training.…”
Section: Resultsmentioning
confidence: 99%
“…Besides the Bonferroni correction, different cross-validation methods are implemented in the semtree package. Cross-validation separates the estimation of SEMs from the testing of a potential cut point (e.g., Jensen and Cohen, 2000). SEM trees can be grown with a two-stage approach (Loh and Shih, 1997;Shih, 2004;Brandmaier et al, 2013b) that splits the sample associated with a node in half.…”
Section: Structural Equation Model Treesmentioning
confidence: 99%
“…Another problem of the current semtree package is that the standard approach to split evaluation (called naïve selection approach in semtree) is biased by favoring the selection of covariates with many unique values over covariates with few unique values (Brandmaier et al, 2013b). The semtree package offers a correction procedure (fair selection approach) for this selection bias (also known as attribute selection error; Jensen and Cohen, 2000). However, this correction procedure is heuristic and comes at the price of decreased statistical power to detect group differences.…”
Section: Introductionmentioning
confidence: 99%