Background: Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies. However, these variable importance measures show a bias towards correlated predictor variables.
SUMMARYThis paper discusses techniques to generate survival times for simulation studies regarding Cox proportional hazards models. In linear regression models, the response variable is directly connected with the considered covariates, the regression coefficients and the simulated random errors. Thus, the response variable can be generated from the regression function, once the regression coefficients and the error distribution are specified. However, in the Cox model, which is formulated via the hazard function, the effect of the covariates have to be translated from the hazards to the survival times, because the usual software packages for estimation of Cox models require the individual survival time data. A general formula describing the relation between the hazard and the corresponding survival time of the Cox model is derived. It is shown how the exponential, the Weibull and the Gompertz distribution can be used to generate appropriate survival times for simulation studies. Additionally, the general relation between hazard and survival time can be used to develop own distributions for special situations and to handle flexibly parameterized proportional hazards models. The use of other distributions than the exponential distribution only is indispensable to investigate the characteristics of the Cox proportional hazards model, especially in non-standard situations, where the partial likelihood depends on the baseline hazard.
Classification trees are a popular tool in applied statistics because their heuristic search approach based on impurity reduction is easy to understand and the interpretation of the output is straightforward. However, all standard algorithms suffer from a major problem: variable selection based on standard impurity measures as the Gini Index is biased. The bias is such that, e.g., splitting variables with a high amount of missing values-even if missing completely at random (MCAR)-are artificially preferred. A new split selection criterion that avoids variable selection bias is introduced. The exact distribution of the maximally selected Gini gain is derived by means of a combinatorial approach and the resulting p-value is suggested as an unbiased split selection criterion in recursive partitioning algorithms. The efficiency of the method is demonstrated in simulation studies and a real data study from veterinary gynecology in the context of binary classification and continuous predictor variables with different numbers of missing values. The proposed method is extendible to categorical and ordinal predictor variables and to other split selection criteria such as the cross-entropy.
For the last eight years, microarray-based class prediction has been the subject of numerous publications in medicine, bioinformatics and statistics journals. However, in many articles, the assessment of classifi cation accuracy is carried out using suboptimal procedures and is not paid much attention. In this paper, we carefully review various statistical aspects of classifi er evaluation and validation from a practical point of view. The main topics addressed are accuracy measures, error rate estimation procedures, variable selection, choice of classifi ers and validation strategy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.