Ultrahigh dimensional data are collected in many scientific fields where the predictor dimension is often much higher than the sample size. To reduce the ultrahigh dimensionality effectively, many marginal screening approaches are developed. However, existing screening methods may miss some important predictors which are marginally independent of the response, or select some unimportant ones due to their high correlations with the important predictors. Iterative screening procedures are proposed to address this issue. However, studying their theoretical properties is not straightforward. Penalized regression are not computationally efficient or numerically stable when the predictors are ultrahigh dimensional. To overcome these drawbacks, Wang (2009) proposed a novel Forward Regression (FR) approach for linear models. However, nonlinear dependence between predictors and the response is often present in ultrahigh dimensional problems. In this paper, we further extend the FR to develop a Forward Additive Regression (FAR) method for selecting significant predictors in ultrahigh dimensional nonparametric additive models. We establish the screening consistency for the FAR method and examine its finite-sample performance by Monte Carlo simulations. Our simulations indicate that, compared with marginal screenings, the FAR is shown to be much more effective to identify important predictors for additive models. When the predictors are highly correlated, the FAR even performs better than the iterative marginal screenings, such as iterative nonparametric independence screening (INIS). We also apply the FAR method to a real data analysis in genetic studies.Key words and phrases: Additive models, forward regression, screening consistency, ultrahigh dimensionality, variable selection.Statistica Sinica: Newly accepted Paper (accepted author-version subject to English editing) 2 W. Zhong, S. Duan and L. Zhu
IntroductionAdvances of modern information technology allow researchers in various scientific fields to collect high dimensional data where the number of predictors is greater than the sample size. Under the sparsity assumption that only a small subset of predictors truly contribute to the response, penalized regression methods have been intensively studied for various parametric and nonparametric models in the literature. They include, but are not limited to, LASSO (Tibshirani, 1996) (Candes and Tao, 2007). These methods are able to select significant variables and estimate parameters simultaneously. As a result, both model interpretability and predictability could be enhanced.When the predictor dimension is much greater than the sample size, the aforementioned penalized approaches may suffer from computational complexity, algorithmic instability or statistical inaccuracy (Fan, Samworth and Wu, 2009). Since the seminal work of Fan and Lv (2008), various marginal screening procedures have been proposed to reduce the ultrahigh dimensionality. The key idea of screening is to rank all predictors using a marginal util...