Many modern biomedical studies have yielded survival data with high-throughput predictors. The goals of scientific research often lie in identifying predictive biomarkers, understanding biological mechanisms and making accurate and precise predictions. Variable screening is a crucial first step in achieving these goals. This work conducts a selective review of feature screening procedures for survival data with ultrahigh dimensional covariates. We present the main methodologies, along with the key conditions that ensure sure screening properties. The practical utility of these methods is examined via extensive simulations. We conclude the review with some future opportunities in this field.
MR Subject Classification
97K80Keywords survival analysis; ultrahigh dimensional predictors; variable screening; sure screening property §1 Introduction Modern biomedical studies have generated abundant survival data with high dimensional biomarkers for various scientific purposes. For instance, identifying genomic profiles that are associated with cancer patients' survival may help with understanding disease progression processes and designing more effective gene therapies. With the advent of new biotechnologies, the emergence of high-throughput data, such as gene expressions, SNPs, methylation and next-generation RNA sequencing, has pushed the dimensionality of data to a larger scale. In these cases, the dimensionality of covariates may grow exponentially with the sample size and such data has been commonly referred to as ultrahigh dimensional data ([5]).When the number of covariates (p) is less than the sample size (n), the parametric regression, such as Weibull models, and the semiparametric regression, such as the Cox proportional hazards model and the Accelerated Failure Time (AFT) model, have been routinely used for modeling censored outcome data in many practical settings. When p > n, penalized ([20], [4], [25], [29]) and the oracle properties and statistical error bounds of estimation have been established ([13], [15]). However, when p ≫ n, computational issues inherent in these methods makes them nonapplicable to ultrahigh-dimensional statistical learning problems because of serious challenges in "computational expediency, statistical accuracy, and algorithmic stability" ([6]). A recent work by [2] did establish the oracle properties of the regularized partial likelihood estimates under an ultrahigh dimensional setting. The results, however, required the optimizers to the penalized partial likelihood function to be unique and global, which is, in general, difficult to verify, especially when the dimension of covariates is exceedingly high.
HHS Public AccessA seminal paper by [5] has demonstrated a simple but useful way to deal with ultrahigh dimensional regression. First, a variable screening procedure is used as a fast and crude tool for reducing the dimensionality to a moderate size (usually below the sample size). In the second step, a more sophisticated technique, such as penalized likelihood methods, c...