Thus far, genome-wide association studies (GWAS) have been disappointing in the inability of investigators to use the results of identified, statistically significant variants in complex diseases to make predictions useful for personalized medicine. Why are significant variables not leading to good prediction of outcomes? We point out that this problem is prevalent in simple as well as complex data, in the sciences as well as the social sciences. We offer a brief explanation and some statistical insights on why higher significance cannot automatically imply stronger predictivity and illustrate through simulations and a real breast cancer example. We also demonstrate that highly predictive variables do not necessarily appear as highly significant, thus evading the researcher using significance-based methods. We point out that what makes variables good for prediction versus significance depends on different properties of the underlying distributions. If prediction is the goal, we must lay aside significance as the only selection standard. We suggest that progress in prediction requires efforts toward a new research agenda of searching for a novel criterion to retrieve highly predictive variables rather than highly significant variables. We offer an alternative approach that was not designed for significance, the partition retention method, which was very effective predicting on a long-studied breast cancer data set, by reducing the classification error rate from 30% to 8%.
The degree of bone loss and the rates of fracture did not differ significantly between the intervention groups. Calcitriol was associated with a higher risk of hypercalciuria. Alendronate-treated patients sustained less bone loss at the spine than those in the reference group, and both intervention groups sustained less bone loss at the hip than the reference group. The requirement for monitoring the serum and urinary calcium levels in calcitriol-treated patients makes alendronate more attractive for the prevention of bone loss early after cardiac transplantation.
A trend in all scientific disciplines, based on advances in technology, is
the increasing availability of high dimensional data in which are buried
important information. A current urgent challenge to statisticians is to
develop effective methods of finding the useful information from the vast
amounts of messy and noisy data available, most of which are noninformative.
This paper presents a general computer intensive approach, based on a method
pioneered by Lo and Zheng for detecting which, of many potential explanatory
variables, have an influence on a dependent variable $Y$. This approach is
suited to detect influential variables, where causal effects depend on the
confluence of values of several variables. It has the advantage of avoiding a
difficult direct analysis, involving possibly thousands of variables, by
dealing with many randomly selected small subsets from which smaller subsets
are selected, guided by a measure of influence $I$. The main objective is to
discover the influential variables, rather than to measure their effects. Once
they are detected, the problem of dealing with a much smaller group of
influential variables should be vulnerable to appropriate analysis. In a sense,
we are confining our attention to locating a few needles in a haystack.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS265 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.