Model selection in high-dimensional settings has received substantial attention in recent years, however, similar advancements in the low-dimensional setting have been lacking. In this article, we introduce a new variable selection procedure for low to moderate scale regressions (n > p). This method repeatedly splits the data into two sets, one for estimation and one for validation, to obtain an empirically optimized threshold which is then used to screen for variables to include in the final model. In an extensive simulation study, we show that the proposed variable selection technique enjoys superior performance compared with candidate methods (backward elimination via repeated data splitting, univariate screening at 0.05 level, adaptive LASSO, SCAD), being amongst those with the lowest inclusion of noisy predictors while having the highest power to detect the correct model and being unaffected by correlations among the predictors. We illustrate the methods by applying them to a cohort of patients undergoing hepatectomy at our institution.
K E Y W O R D Sdata splitting, empirical threshold, linear regression, variable screening, variable selection
INTRODUCTIONRegression models have become the primary engine for many data analyses, attempting to explain the variability in the response variable by identifying important predictors. Due to their popularity, variable selection for multivariable regression continues to be one of the most important problems in applied statistics. An ideal variable selection method would include all "true" predictors in the model while excluding the "noisy" predictors, that is, those with little or no association with the response variable. These two objectives correspond to controlling the type II and type I errors in hypothesis testing. As such, there is a tradeoff between maximizing selection of true predictors (full model) and minimizing selection of noisy predictors (intercept only model) and a good variable selection procedure has to strike a balance between, with the goal of obtaining an accurate yet parsimonious model. A related way of thinking about this problem is the bias-variance tradeoff that comes with model complexity: more complex models will have higher variance and lower bias, while simpler models will have higher bias and lower variance. 1 Since the expected prediction error on new data can be decomposed into a bias component, a variance component, and noise (which is beyond our control), by choosing the model complexity to trade bias off with variance we, in effect, minimize the test error. This is a desirable goal for a model selection procedure since we want the model to predict future observations as well as it predicts the observed data.Many variable selection approaches have been proposed over the years including significance filtering, stepwise methods and penalized regression (some reviews can be found in References 1-3, among others). After the Statistics in Medicine. 2020;39:2167-2184.wileyonlinelibrary.com/journal/sim